Techniques to provide significance for statistical tests

ABSTRACT

Techniques to provide significance for statistical tests are described. An apparatus may comprise a data handler component to receive a real data set from a client device, the real data set to comprise data representing at least one measurable phenomenon, a statistical test component to receive a computational representation arranged to generate an approximate probability distribution for statistics of a statistical test based on a parameter vector, the statistics of the statistical test to follow a probability distribution, generate statistics for the statistical test using the real data set, generate the approximate probability distribution of the computational representation, and a significance generator component to generate a set of statistical significance values for the statistics through interpolation using the approximate probability distribution, the set of statistical significance values comprising one or more p-values. Other embodiments are described and claimed.

RELATED CASES

This application is a continuation of U.S. patent application Ser. No.14/270,662 titled “TECHNIQUES TO SIMULATE STATISTICAL TESTS” filed onMay 6, 2014, which is hereby incorporated by reference in its entirety.

BACKGROUND

In some cases, a computer system may be used to perform statisticaltests. This decision is normally a function of, in part, a size of adata set needed to perform a given statistical test. Even a moderatelycomplex statistical test may require a massive data set, sometimes onthe order of terabytes for example, to produce sufficiently accurateresults.

SUMMARY

The following presents a simplified summary in order to provide a basicunderstanding of some novel embodiments described herein. This summaryis not an extensive overview, and it is not intended to identifykey/critical elements or to delineate the scope thereof. One purpose isto present some concepts in a simplified form as a prelude to the moredetailed description that is presented later.

Various embodiments are generally directed to techniques to performautomated statistical testing. Some embodiments are particularlydirected to techniques to determine statistical significance of testresults from a statistical test using a distributed processing system.In one embodiment, for example, an apparatus may comprise processorcircuitry and a data handler component operative on the processorcircuitry to receive a real data set from a client device, the real dataset to comprise data representing at least one measurable phenomenon.The apparatus may further comprise a statistical test componentoperative on the processor circuitry to receive a computationalrepresentation arranged to generate an approximate probabilitydistribution for statistics of a statistical test based on a parametervector, the statistics of the statistical test to follow a probabilitydistribution, generate statistics for the statistical test using thereal data set, generate the approximate probability distribution of thecomputational representation. The apparatus may further comprise asignificance generator component operative on the processor circuitry togenerate a set of statistical significance values for the statisticsthrough interpolation using the approximate probability distribution,the set of statistical significance values comprising one or morep-values, each p-value to represent a probability of obtaining a giventest statistic from the real data set. Other embodiments are describedand claimed.

To the accomplishment of the foregoing and related ends, certainillustrative aspects are described herein in connection with thefollowing description and the annexed drawings. These aspects areindicative of the various ways in which the principles disclosed hereincan be practiced and all aspects and equivalents thereof are intended tobe within the scope of the claimed subject matter. Other features willbecome apparent from the following detailed description when consideredin conjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of an automated statistical test system.

FIG. 2 illustrates an example of a first logic flow for a simulationsubsystem.

FIG. 3 illustrates an example of a first operational environment.

FIG. 4 illustrates an example of a second operational environment.

FIG. 5 illustrates an example of a third operational environment.

FIG. 6 illustrates an example of a computing system.

FIG. 7 illustrates an example of a distributed computing system.

FIG. 8 illustrates an example of a second logic flow for a simulationsubsystem.

FIG. 9 illustrates an example of a third logic flow for a simulationsubsystem.

FIG. 10 illustrates an example of a fourth logic flow for a simulationsubsystem.

FIG. 11 illustrates an example of a fifth logic flow for a simulationsubsystem.

FIG. 12 illustrates an example of a sixth logic flow for a simulationsubsystem.

FIG. 13 illustrates an example of a first simulated data structure.

FIG. 14 illustrates an example of a fourth operational environment.

FIG. 15 illustrates an example of a second simulated data structure.

FIG. 16 illustrates an example of a fifth operational environment.

FIG. 17 illustrates an example of a third simulated data structure.

FIG. 18 illustrates an example of a seventh logic flow for a simulationsubsystem.

FIG. 19 illustrates an example of a sixth operational environment.

FIG. 20 illustrates an example of an eighth logic flow for a simulationsubsystem.

FIG. 21A illustrates an example of a seventh operational environment.

FIG. 21B illustrates an example of a ninth logic flow for a simulationsubsystem.

FIG. 22 illustrates an example of an eighth operational environment.

FIG. 23 illustrates an example of a tenth logic flow for a simulationsubsystem.

FIG. 24 illustrates an example of an eleventh logic flow for asimulation subsystem.

FIG. 25 illustrates an example of a twelfth logic flow for a simulationsubsystem.

FIG. 26 illustrates an example of a thirteenth logic flow for asimulation subsystem.

FIG. 27 illustrates an example of a fourteenth logic flow for asimulation subsystem.

FIG. 28A illustrates an example of a statistical test subsystem.

FIG. 28B illustrates an example of a user interface view for astatistical test subsystem.

FIG. 29 illustrates an example of a logic flow for a statistical testsubsystem.

FIG. 30 illustrates an example of a centralized system.

FIG. 31 illustrates an example of a distributed system.

FIG. 32 illustrates an example of a computing architecture.

FIG. 33 illustrates an example of a communications architecture.

FIG. 34 illustrates an example of an article of manufacture.

DETAILED DESCRIPTION

In statistics, a result is considered statistically significant if, forexample, it has been predicted as unlikely to have occurred by chancealone, according to a pre-determined threshold probability, referred toas a significance level. A statistical test is used in determining whatoutcomes of a study would lead to a rejection of a null hypothesis for apre-specified level of significance. A null hypothesis refers to adefault position, such as there is no relationship between two measuredphenomena, for example, that a potential medical treatment has noeffect. Statistical significance is instructive in determining whetherresults contain enough information to cast doubt on the null hypothesis.

Various embodiments described and shown herein are generally directed totechniques to perform enhanced automated statistical testing. Someembodiments are particularly directed to an automated statistical testsystem arranged to determine statistical significance of test resultsfrom a statistical test. In one embodiment, for example, the automatedstatistical test system may include a simulation subsystem and astatistical test subsystem. The simulation subsystem may, among otherfeatures, generate an approximate probability distribution for thestatistics of a statistical test. The statistical test subsystem may,among other features, generate statistical significance values forresults of a statistical test using an approximate probabilitydistribution. Embodiments are not limited to these subsystems.

With general reference to notations and nomenclature used herein, thedetailed descriptions which follow may be presented in terms of programprocedures executed on a computer or network of computers. Theseprocedural descriptions and representations are used by those skilled inthe art to most effectively convey the substance of their work to othersskilled in the art.

A procedure is here, and generally, conceived to be a self-consistentsequence of operations leading to a desired result. These operations arethose requiring physical manipulations of physical quantities. Usually,though not necessarily, these quantities take the form of electrical,magnetic or optical information capable of being stored, transferred,combined, compared, and otherwise manipulated. It proves convenient attimes, principally for reasons of common usage, to refer to this“information” as bits, values, elements, symbols, characters, terms,numbers, or the like. It should be noted, however, that all of these andsimilar terms are to be associated with the appropriate physicalquantities and are merely convenient labels applied to those quantities.

Further, the manipulations performed are often referred to in terms,such as adding or comparing, which are commonly associated with mentaloperations performed by a human operator. No such capability of a humanoperator is necessary, or desirable in most cases, in any of theoperations described herein which form part of one or more embodiments.Rather, the operations are machine operations. Useful machines forperforming operations of various embodiments include general purposedigital computers or similar devices.

Various embodiments also relate to apparatus or systems for performingthese operations. This apparatus may be specially constructed for therequired purpose or it may comprise a general purpose computer asselectively activated or reconfigured by a computer program stored inthe computer. The procedures presented herein are not inherently relatedto a particular computer or other apparatus. Various general purposemachines may be used with programs written in accordance with theteachings herein, or it may prove convenient to construct morespecialized apparatus to perform the required method steps. The requiredstructure for a variety of these machines will appear from thedescription given.

Reference is now made to the drawings, wherein like reference numeralsare used to refer to like elements throughout. In the followingdescription, for purposes of explanation, numerous specific details areset forth in order to provide a thorough understanding thereof. It maybe evident, however, that the novel embodiments can be practiced withoutthese specific details. In other instances, well known structures anddevices are shown in block diagram form in order to facilitate adescription thereof. The intention is to cover all modifications,equivalents, and alternatives consistent with the claimed subjectmatter.

FIG. 1 illustrates a block diagram for an automated statistical testsystem 100. In one embodiment, the automated statistical test system 100may be implemented as a computer system having a simulation subsystem120 and a statistical test subsystem 140. The subsystems 120, 140 mayeach be implemented as a separate or integrated software applicationcomprising one or more components, such as components 122-a as shown forthe simulation subsystem 120 in FIG. 1. Although the automatedstatistical test system 100 shown in FIG. 1 has a limited number ofelements in a certain topology, it may be appreciated that the automatedstatistical test system 100 may include more or less elements inalternate topologies as desired for a given implementation.

It is worthy to note that “a” and “b” and “c” and similar designators asused herein are intended to be variables representing any positiveinteger. Thus, for example, if an implementation sets a value for a=4,then a complete set of components 122-a may include components 122-1,122-2, 122-3 and 122-4. The embodiments are not limited in this context.

In various embodiments, portions of the automated statistical testsystem 100 may be implemented as software components comprising computerexecutable instructions implemented in a given programming language. Inone embodiment, for example, the computer executable instructions may beimplemented in a specific programming language as developed by SAS®Institute, Inc., Cary, N.C. For instance, the computer executableinstructions may be implemented in a procedure referred to herein asHPSIMULATE, which is a procedure suitable for execution within a SASprogramming language and computing environment. In such embodiments, thecomputer executable instructions may follow syntax and semanticsassociated with HPSIMULATE, as described in more detail with referenceto FIG. 34, infra. However, embodiments are not limited to HPSIMULATE,and further, do not need to necessarily follow the syntax and semanticsassociated with HPSIMULATE. Embodiments are not limited to a particulartype of programming language.

As shown in FIG. 1, the automated statistical test system 100 mayinclude two subsystems, a simulation subsystem 120 and a statisticaltest subsystem 140. The simulation subsystem 120 may generate acomputational representation 130 arranged to generate approximateprobability distribution 132 for a statistical test 114. The statisticaltest subsystem 140 may generate statistical significance values forresults of the statistical test 114 using an approximate probabilitydistribution 132 generated by the computational representation 130.

The simulation subsystem 120 may be generally arranged to perform astatistical simulation for a variety of statistical tests 114. Thestatistical test 114 may include any known statistical test asrepresented by the statistical test function 112. Some examples for thestatistical test 114 may include without limitation median test, modetest, R test, means test, t-test for single means, independent t-test,dependent t-test, Wald-Wolfowitz runs test, Kolmogorov Smirnov test,Mann-Whitney U test, sign test, Wilcoxon matched pairs test, alternativeto one-way between-groups analysis of variance (ANOVA) test, one-wayANOVA test, Kruskal-Wallis ANOVA test, repeated measures ANOVA test,Friedman ANOVA test, Kendall Concordance test, Pearson product momentcorrelation test, Spearman correlation test, linear regression test,data mining decision tree tests, neural network tests, nonlinearestimation test, discriminant analysis test, predictor importance test,KPSS unit root test, Shin cointegration test, ERS unit root test, Baiand Perron's multiple structural change tests (e.g., maxF, UDmaxF,WDmaxF, supF_(l+1|l), etc.), Im, Pesaran and Shin (2003) panel unit roottest, Bhargava, Franzini and Narendranathan (1982) test, generalizedDurbin-Watson statistics, generalized Berenblut-Webb statistics forfirst-order correlation in a fixed effects model, Gourieroux, Holly andMonfort (1982) test for random effects (two way), Johansen'scointegration rank test, and many others. Embodiments are not limited inthis context.

The simulation subsystem 120 may be arranged to generate an approximateprobability distribution, probability distribution function, ordistribution function (collectively referred to herein as an“approximate probability distribution”) for the statistics of astatistical test 114. A probability distribution assigns a probabilityto each measurable subset of possible outcomes of a random experiment,survey, or procedure of statistical inference. A probabilitydistribution can either be univariate or multivariate. A univariatedistribution gives the probabilities of a single random variable takingon various alternative values. A multivariate distribution givesprobabilities of a random vector (e.g., a set of two or more randomvariables) taking on various combinations of values.

More particularly, a statistical test 114 is normally based on a “teststatistic.” In statistical hypothesis testing, a hypothesis test istypically specified in terms of a test statistic, which is a function ofthe sample. A test statistic is considered as a numerical summary of adata-set that reduces the data to one value that can be used to performa hypothesis test. In general, a test statistic is selected or definedin such a way as to quantify, within observed data, behaviors that woulddistinguish the null from the alternative hypothesis where such analternative is prescribed, or that would characterize the nullhypothesis if there is no explicitly stated alternative hypothesis.

An important property of a test statistic is that its samplingdistribution under the null hypothesis must be calculable, eitherexactly or approximately, which allows p-values to be calculated. A teststatistic is a function of associated data and a model. Under theassumptions of a null hypothesis and the model the test statistic has anassociated “sampling distribution.” A sampling distribution refers to aprobability distribution for values of the test statistic overhypothetical repeated random samples of the data, for random datasamples having the probability distribution assumed for the data by themodel and null hypothesis.

In one embodiment, for example, the simulation subsystem 120 attempts todetermine and approximate a sampling distribution of a test statisticunder an assumed null hypothesis to generate an approximate probabilitydistribution. The simulation subsystem 120 determines an approximateprobability distribution for a given set of statistics of a statisticaltest 114. It is worthy to note that in some embodiments when anapproximate probability distribution is said to be associated with agiven statistical test 114, it implies that the approximate probabilitydistribution is associated with a set of statistics for the statisticaltest 114 rather than the statistical test 114 alone.

In various embodiments, a probability distribution may have a “knownform” and/or an “unknown form.” A probability distribution of a “knownform” means that the analytical formula of the cumulative distributionfunction (CDF) of the distribution can be efficiently computed, forexample, the CDF is a closed-form expression, or the CDF can be wellapproximated in a numerical method. A probability distribution of an“unknown form” means that the analytical formula of the CDF of thedistribution is unavailable, or cannot be efficiently computed orapproximated by any known numerical method. Accordingly, the probabilitydistribution of an “unknown form” is to be evaluated through simulation.

In various embodiments, the simulation subsystem 120 may be arranged togenerate a probability distribution for the statistics of a givenstatistical test having a known form and/or an unknown form. In oneembodiment, for example, a probability distribution for the statisticsof a given statistical test 114 is a known form, such as a Gaussiandistribution, a log-normal distribution, a discrete uniformdistribution, a continuous uniform distribution, and many others.However, the statistics of some statistical tests 114 may follow aprobability distribution of unknown form. In such cases, a probabilitydistribution of unknown form may be approximated through empiricalmeasure. An empirical measure is a random measure arising from aparticular realization of a (usually finite) sequence of randomvariables. As such, in another embodiment, the simulation subsystem 120may generate an approximate probability distribution 132 for thestatistics of a given statistical test 114 where a probabilitydistribution for the statistics of the statistical test is an unknownform. This may be particularly useful in those cases where thestatistics of a statistical test 114 follow a probability distributionfor which no known mathematical formula is available to compute itsvalues and which therefore can only be evaluated through simulation.

The simulation subsystem 120 may receive as input a simulated datafunction 110 arranged to generate simulated data for a given statisticaltest 114. The simulation subsystem 120 may further receive as input astatistical test function 112 arranged to perform the statistical test114. The simulation subsystem 120 may execute the simulated datafunction 110 to generate simulated data for the statistical test 114,and the statistical test function 112 to simulate statistics from thesimulated data, and create a computational representation 130 togenerate an approximate probability distribution 132 from the simulatedstatistics. The computational representation 130 may, for example, beused by another software program at some future time to perform anactual statistical test 114, such as a statistical test subsystem 140.The statistical test subsystem 140 may, for example, perform thestatistical test 114 on actual data sets (e.g., organization data,business data, enterprise data, etc.), and generate statisticalsignificance values utilizing one or more approximate probabilitydistributions 132 generated by the computational representation 130.

Examples for an approximate probability distribution 132 may includewithout limitation an empirical distribution function or empirical CDF.An empirical CDF is a cumulative distribution function associated withan empirical measure of a sample. The simulation subsystem 120 maygenerate other approximate probability distributions 132 as well usingthe techniques described herein. The embodiments are not limited in thiscontext.

The simulation subsystem 120 may generate an approximate probabilitydistribution 132 for the statistics of a statistical test 114 where anactual probability distribution for the statistics of the statisticaltest 114 is of a known or unknown form. For example, when a statisticaltest 114 has a probability distribution of a known form, the approximateprobability distribution 132 may be useful to evaluate or refine theknown probability function. In another example, when the statistics of astatistical test 114 follow a probability distribution of an unknownform, the approximate probability distribution 132 may be useful togenerate statistical significance values for a statistical test 114. Thelatter example may be particularly useful in those cases where astatistical test 114 has a level of complexity that makes manualestimation of an approximate probability distribution 132 untenable.

The simulated subsystem 120 may comprise a simulated data component122-1. The simulated data component 122-1 may be generally arranged togenerate simulated data for a statistical test 114 utilizing thesimulated data function 110. The simulated data function 110 may bestored as part of a software library. In this way, the simulated datacomponent 122-1 may generate many different types of simulated data fora given statistical test 114, without having to alter or modifyinstructions for the simulated data component 122-1. Alternatively, thesimulated data function 110 may be integrated with the simulated datacomponent 122-1. The simulated data component 122-1 may be described inmore detail with reference to FIG. 3, infra.

The simulated subsystem 120 may comprise a statistic simulator component122-2. The statistic simulator component 122-2 may be generally arrangedto simulate statistics for the statistical test 114 from the simulateddata utilizing the statistical test function 112. As with the simulateddata function 110, the statistical test function 112 may be stored aspart of a software library. In this way, the statistic simulatorcomponent 122-2 may simulate many different types of statistical tests114 with a given set of simulated data, without having to alter ormodify instructions for the statistic simulator component 122-2.Alternatively, the statistical test function 112 may be integrated withthe statistical simulator component 122-2. The statistic simulatorcomponent 122-2 may be described in more detail with reference to FIG.4, infra.

The simulated data function 110 and the statistical test function 112may be dependent or independent with respect to each other. In oneembodiment, the simulated data function 110 and the statistical testfunction 112 may be complementary, where a simulated data set isspecifically tuned for a given statistical test 114. In one embodiment,the simulated data function 110 and the statistical test function 112may be independently designed.

The statistic simulator component 122-2 may include a simulation controlengine 124. In one embodiment, the simulation control engine 124 may begenerally arranged to control simulation operations across a distributedcomputing system. A distributed computing system may comprise, forexample, multiple nodes each having one or more processors capable ofexecuting multiple threads, as described in more detail with referenceto FIG. 6, infra.

The use of a distributed computing system to generate simulatedstatistics may be useful for statistical tests 114 that need a largerdata set. While simulating a statistic for one specific parameter vectormay be relatively easy, simulating statistics for all possible parametervectors could be computational intensive. As such, a distributedcomputing system may reduce simulation time.

The simulation control engine 124 may distribute portions of simulateddata or simulated statistics across multiple nodes of the distributedcomputing system in accordance with a column-wise or acolumn-wise-by-group distribution algorithm, for example. The use of adistributed computing system in general, and the column-wise orcolumn-wise-by-group distribution algorithm in particular, substantiallyreduces an amount of time needed to perform the simulation. In somecases, an amount of time needed to perform a simulation may be reducedby several orders of magnitude (e.g., years to days or hours),particularly with larger data sets (e.g., terabytes) needed for evenmoderately complex statistical tests. The simulation control engine 124may be described in more detail with reference to FIG. 5, infra.

The simulation subsystem 120 may comprise a code generator component122-3. The code generator component 122-3 may be generally arranged tocreate a computational representation 130. The computationalrepresentation 130 may be arranged to generate an approximateprobability distribution 132 for the statistics of a statistical test114 on a parameter vector from the simulated statistics. The codegenerator component 122-3 may be described in more detail with referenceto FIG. 19, infra.

The computational representation 130 may be created as any softwarecomponent suitable for execution by a processor circuit. Examples forthe computational representation 130 may include without limitation afunction, procedure, method, object, source code, object code, assemblycode, binary executable file format, simple executable (COM) file,executable file (EXE), portable executable (PE) file, new executable(NE) file, a dynamic-link library (DLL), linear executable (LX) file,mixed linear executable (LE) file, a collection of LE files (W3) file, acompressed collection of LE files (W4) file, or other suitable softwarestructures. The computational representation 130 may be generated in anycomputer programming language. Embodiments are not limited in thiscontext.

The simulated subsystem 120 may comprise an evaluation component 122-4.The evaluation component 122-4 may be generally arranged to evaluate acomputational representation 130 for performance. For instance, theevaluation component 122-4 may receive a computational representation130 arranged to generate an approximate probability distribution 132 forthe statistics of the statistical test 114 on a parameter vector fromthe simulated statistics. The computational representation 130 mayinclude a simulated data structure with information for one or moreestimated CDF curves. The evaluation component 122-4 may perform atleast two kinds of evaluations on the computational representation 130.

A first type of evaluation is a performance evaluation. The directevaluation attempts to determine whether the computationalrepresentation 130 performs according to a defined set of criteria. Ifthe computational representation 130 does not meet one or more of thedefined set of criteria, the evaluation component 122-4 may determinewhether points should be added to the simulated data structure toimprove performance of the computational representation 130.

A second type of evaluation is a reduction evaluation. As with theperformance evaluation, the reduction evaluation may attempt todetermine whether the computational representation 130 performsaccording to a defined set of criteria. If the computationalrepresentation 130 does meet one or more of the defined set of criteria,the evaluation component 122-4 may further determine whether points canbe removed from the simulated data structure to give a same or similarlevel of performance. Removing points from the simulated data structuremay reduce a data storage size for the simulated data structure, and adata storage size for a corresponding computational representation 130having the reduced simulated data structure.

When reduction is possible, the evaluation component 122-4 may attemptto reduce a data storage size for a computational representation 130.The evaluation component 122-4 may evaluate the simulated data structureto determine whether any points in the grid of points is removable fromthe simulated data structure given a target level of precision. Theevaluation component 122-4 may reduce the simulated data structure inaccordance with the evaluation to produce a reduced simulated datastructure, the reduced simulated data structure to reduce a data storagesize for the computational representation 130. In some cases, thereduced simulated data structure may be obtained by lowering a level ofprecision for the reduced simulated data structure relative to theoriginal simulated data structure. The evaluation component 122-4 may bedescribed in more detail with reference to FIG. 22, infra.

Included herein is a set of flow charts representative of exemplarymethodologies for performing novel aspects of the disclosedarchitecture. While, for purposes of simplicity of explanation, the oneor more methodologies shown herein, for example, in the form of a flowchart or flow diagram, are shown and described as a series of acts, itis to be understood and appreciated that the methodologies are notlimited by the order of acts, as some acts may, in accordance therewith,occur in a different order and/or concurrently with other acts from thatshown and described herein. For example, those skilled in the art willunderstand and appreciate that a methodology could alternatively berepresented as a series of interrelated states or events, such as in astate diagram. Moreover, not all acts illustrated in a methodology maybe required for a novel implementation.

FIG. 2 illustrates one example of a logic flow 200. The logic flow 200may be representative of some or all of the operations executed by oneor more embodiments described herein, such as the simulation subsystem120 of the automated statistical test system 100.

In the illustrated embodiment shown in FIG. 2, the logic flow 200 maygenerate simulated data for a statistical test, the statistics of thestatistical test based on parameter vectors to follow a probabilitydistribution of a known or unknown form at block 202. For example, thesimulated data component 122-1 may generate simulated data for astatistical test 114, while the statistics of the statistical test 114based on parameter vectors follow a probability distribution of a knownor unknown form. The simulated data component 122-1 may generate thesimulated data with a simulated data function 110. In one embodiment,for example, the simulated data function 110 may be designed to generatesimulated data for a multiple structural change (maxF) test.

The logic flow 200 may simulate statistics for the parameter vectorsfrom the simulated data, each parameter vector to be represented with asingle point in a grid of points at block 204. For example, thestatistic simulator component 122-2 may receive simulated data from thesimulated data component 122-1, and simulate statistics for astatistical test 114 with a statistical test function 112. In oneembodiment, for example, the statistical test function 112 may bedesigned to implement a multiple structural change (maxF) test.

The statistic simulator component 122-2 may simulate statistics for oneor more parameter vectors of the statistical test, each parameter vectorto comprise a single point in a grid of points. The statistic simulatorcomponent 122-2 may simulate statistics for all given parameter vectors(p) for a statistical test (T) from the simulated data. The statisticsof the statistical test T based on a given parameter vector p followsome probability distribution (D). The simulation subsystem 120 mayapproximate D with simulation. For any given parameter vector p, thestatistic simulator component 122-2 can randomly draw a sampleX={X_(i)}N_(i=1) ^(N) from D and construct an approximate probabilitydistribution 132 in the form of an empirical CDF {tilde over (T)}(p, x).The empirical CDF {tilde over (T)}(p, x) may have a level of precisionas measured by a Kolmogorov-Smirnov statistic shown in Equation (1) asfollows:

$\begin{matrix}{{\sqrt{N}\sup\limits_{x}{{{\overset{\sim}{T}\left( {p,x} \right)} - {T\left( {p,x} \right)}}}} \sim K} & {{Equation}\mspace{14mu} (1)}\end{matrix}$

where T(p, x) represents a true unknown CDF, and distribution K is aKolmogorov distribution and a table of the distribution shows K(3) ofalmost 1. In accordance with Equation (1), the empirical CDF {tilde over(T)}(p, x) may have a precision of approximately 1/√{square root over (N)} and in almost all cases below 3/√{square root over (N)}, where N isthe sample size, or the number of simulated statistics, for the givenparameter vector p. For example, when N=1,000,000, the precision isabout 0.001.

As the statistic simulator component 122-2 may utilize variousinterpolation techniques to generate approximate probabilitydistributions 132 for one or more parameter vectors for a statisticaltest 114, each parameter vector may be referred to as a “point” in agrid of points (M) used for interpolation. In this context, for example,the term “point” is a mathematical point within a defined problem space.In one embodiment, for instance, the problem space may comprise a“parameter space” for a statistical test 114, with the parameter spacemade up of a given set of parameter vectors for the statistical test114. In other words, a specific value of a parameter vector is a pointin the “parameter space” of a mathematical problem. If elements of oneor more parameter vectors (e.g., the parameters of the problem) areplotted on Cartesian coordinates, then the parameter vector may bemapped to a point on a graph in a conventional manner.

The logic flow 200 generates quantiles for each point in the grid ofpoints at block 208. For example, the statistic simulator component122-2 may generate quantiles for each point in the grid of points.Quantiles may refer to data values taken at regular intervals from thecumulative distribution function (CDF) of a random variable. The datavalues may mark boundaries between consecutive data subsets of anordered set of data.

The logic flow 200 involves fitting an estimated CDF curve for eachpoint in the grid of points independently from other points in the gridof points using a number of curve parameters to provide a given level ofprecision at block 210. For example, the statistic simulator component122-2 may fit an estimated CDF curve for each point in the grid ofpoints independently from other points in the grid of points using anumber of curve parameters to provide a given level of precision.Fitting an estimated CDF curve for each point independently cansignificantly reduce computational resources needed for curve-fittingoperations. For instance, in a simple case, the dimension of the point,p, is only 1; that is to say, p is a real number. Rather than fittingestimated CDF curves for all points in the grid of points simultaneouslyto build an actual three-dimensional surface, (p, x, {tilde over (T)}(p,x)), the statistic simulator component 122-2 fits an estimated curve,(x, {tilde over (T)}(p, x)), for each point p in sequence or parallel,and then combines the estimated curves to form an approximatethree-dimensional surface. Although the approximate three-dimensionalsurface may have a reduced level of precision relative to the actualthree-dimensional surface, curve-fitting operations are greatlyaccelerated and may consume fewer computational resources. Reducinglatency may be of particular importance with larger data sets ormulti-dimensional parameter vectors needed for some statistical tests.

The statistic simulator component 122-2 may fit an estimated CDF curvefor each point in the grid of points using various types ofcurve-fitting techniques. For instance, the statistic simulatorcomponent 122-2 may utilize, for example, a Gaussian mixture model (EMalgorithm), a Bernstein-Polynomials mixture model (EM algorithm), or amonotone cubic spline technique. In one embodiment, the statisticsimulator component 122-2 may perform curve-fitting utilizing amonotonic cubic spline interpolation technique with beta transformation,as described in more detail with reference to FIG. 18, infra.Embodiments are not limited to this example.

The logic flow 200 may generate a computational representation as sourcecode to interpolate an estimated CDF curve for any point of thestatistical test at block 212. For example, the code generator component122-3 may generate a computational representation 130 as source code tointerpolate an estimated CDF curve for any given point of thestatistical test 114. In one embodiment, the point may be within thegrid of points. In one embodiment, the point may be outside the grid ofpoints. In one embodiment, the point may be entirely disassociated fromthe grid of points.

In one embodiment, the computational representation 130 may be generatedin computer programming language, such as C or C++ for example. However,embodiments are not limited to these particular computer programminglanguages.

The logic flow 200 may reduce a data storage size for the computationrepresentation at block 214. For example, the evaluation component 122-4may reduce a data storage size for the computational representation 130through reduction of various components of the computationalrepresentation 130, with a corresponding loss in precision. In oneembodiment, the data reduction operations may be described in moredetail with reference to FIG. 22, infra. Embodiments are not limited tothis example.

The logic flow 200 involves controlling task execution of a distributedcomputing system using a virtual software class at block 216. Forexample, the simulation control engine 124 of the statistic simulatorcomponent 122-2 may control task execution of a distributed computingsystem using a virtual software class. In addition, a virtual softwareclass may also be used for other operations of the logic flow 200,including without limitation blocks 202, 208, 210, 212 and 214, forexample. A virtual software class may be described in more detail withreference to FIG. 5, infra.

FIG. 3 illustrates an example of an operational environment 300. Theoperational environment 300 may illustrate operation of portions of theautomated statistical test system 100, such as the simulated datacomponent 122-1, for example.

As shown in FIG. 3, the simulated data component 122-1 may have asimulated data generator 320. In addition to, or as an alternative of,receiving a simulated data function 110, the simulated data generator320 may receive a structured input file 310 and a randomizer function312. The structured input file 310 may have definitions to generatesimulated data 330. The randomizer function 312 may generate seeds orrandom numbers (e.g., a random number generator) for the simulated data330. The simulated data generator 320 may utilize the simulated datafunction 110, the structured input file 310, and/or the randomizerfunction 312 to generate the simulated data 330. The simulated datagenerator 320 may store the simulated data 330 in a simulation database340. In one embodiment, for example, the simulated data 330 may bestored in the simulation database 340 in accordance with definitionsprovided by the structured input file 310.

The structured input file 310 may generally comprise one or more inputfiles with data generation specifications and definitions useful for thesimulated data component 122-1 to automatically producing simulated data330. The specifications and definitions may be in addition to, orreplacement of, specifications and definitions used by the simulateddata function 110. The structured input file 310 may utilize any formatas long as the input files are structured in a known and well-definedmanner. The structured input file 310 provides information about thesimulated data 330 and the simulation database 340, among other types ofinformation. For instance, the structured input file 310 may provideinformation about a computing environment in which the simulationsubsystem 120 will run, a database to store the simulated data 330, datastructures for the simulated data 330, table space (e.g., table,columns, rows, indices, etc.), the type of simulated data 330 requiredby each column of output tables in the simulation database 340, how togenerate each type of simulated data 330, relationships between columnsin a same table and columns in different tables, and other informationpertinent to generating simulated data 330.

A particular number of data sets for the simulated data 330 may bedependent, in part, on a particular type of statistical test 114. In oneembodiment, for example, assume the statistical test function 112 isdesigned to implement a multiple structural change (maxF) test. Forexample, in order to have a 3-digit precision, the simulated datagenerator 320 may need to generate a sufficient number of data sets tocalculate approximately 1,000,000 statistics for each point in a definedgrid of points.

FIG. 4 illustrates an example of an operational environment 400. Theoperational environment 400 may illustrate the operation of portions ofthe automated statistical test system 100, such as the statisticsimulator component 122-2, for example.

As shown in FIG. 4, the statistic simulator component 122-2 may includea simulated statistic generator 420. The simulated statistic generator420 may receive simulated data 330 generated by the simulated datacomponent 122-1, and use (e.g., call) the statistical test function 112to generate a set of simulated statistics 430 for a statistical test 114with the simulated data 330. As with the simulated data 330, thesimulated statistics 430 may be stored in the simulation database 340,or a separate database entirely.

The statistic simulator component 122-2 may generate the simulatedstatistics 430 in different ways using various types of computersystems, including a centralized computing system and a distributedcomputing system. The statistic simulator component 122-2 may specifyand control a particular computer system used for simulation through thesimulation control engine 124.

The statistic simulator component 122-2 may generate the simulatedstatistics using an exemplary procedure, as follows:

  PROC HPSIMULATE   data=scbpParms /* table containing simulationparameters */   datadist=(COPYTONODES);  MODULE name=SCBP   ext=tkscbp/* TK Extension to plug-in */   var=(T mmax NQ Q1 Q20 NEPS EPS1 - EPS50)/* variables */   task=0 /* Task : Simulation */   taskParmN=(1000000/*number of replications*/   6000000 /*random seed */ );  OUTPUTout=scbpSimulation;  PERFORMANCE nnodes=200 nthreads=6; RUN.The statistic simulator component 122-2 is not limited to this example.

FIG. 5 illustrates an example of an operational environment 500. Theoperational environment 500 may illustrate operation of portions of theautomated statistical test system 100, such as the simulation controlengine 124 of the statistical test component 122-2, for example.

As shown in FIG. 5, the simulation control engine 124 may include amessage interface 520. The message interface 520 may receive thesimulated data 330 from the simulated data generator 122-1, or retrievethe simulated data 330 from the simulation database 340, and generate asimulation request 530. The simulation request 530 may be a request togenerate simulated statistics 430 from the simulated data 330 using thestatistical test function 112.

The simulation request 530 may include various types of informationabout the statistical test 114, as well as information about a computingenvironment suitable for generating the simulated statistics 430.Examples of computing environment information may include withoutlimitation a name, description, speed requirements, power requirements,operating system requirements, database requirements, computingparameters, communications parameters, security parameters, and soforth. Depending on a particular statistical test 114, the computingenvironment information may specify a configuration for a computersystem having different combinations of computation resources, such as anumber of servers, server types, processor circuits, processor cores,processing threads, memory units, memory types, and so forth. Forexample, the computer environment information may request a singlecomputer with a single processor and a single thread, a single computerwith a single processor and multiple threads, a single computer withmultiple processors (or processing cores) each with a single thread, asingle computer with multiple processors (or processing cores) each withmultiple threads, multiple computers each with a single processor and asingle thread, multiple computers each with a single processor andmultiple threads, multiple computers with multiple processors each witha single thread, and multiple computers with multiple processors eachwith multiple threads, or any combination thereof.

A computing environment for a statistical test simulation may beparticularly important when a simulation for a particular statisticaltest needs a larger set of data, such as in the gigabyte or terabyterange. Enumeration of all possible points could lead to a relativelylarge grid of points M. Continuing with our previous example of amultiple structural change (maxF) test, in order to have 3-digitprecision, the simulated data generator 320 may need to generate asufficient number of data sets to simulate approximately 1,000,000statistics for each point in a defined grid of points. Assuming a numberof variables is limited to less than 20, a possible number of structuralchanges is limited to less than 19, and a number of observations is2,000 to approximate an asymptotic case, a defined grid of points forthe maxF test would contain approximately 103,780 points (parametervectors). To simulate 1,000,000 statistics for each of 103,780 points ona single processor, at roughly 0.001 seconds per statistic, would takeapproximately 1,200 days. Alternatively, executing 1,000,000 statisticsfor each of 103,780 points on 1200 processors, at roughly 0.001 secondsper statistic, would take approximately 1 day. For a computational taskof this size, the message interface 520 may generate a simulationrequest 530 with computer environment information specifying a need fordistributed computations in a distributed computing environment havingmultiple computers with multiple processors each with multiple threadsoperating in a parallel processing manner.

In one embodiment, the simulation control engine 124 may distributeportions of the simulated data 330 across various parts of a distributedcomputing environment, and control generation of simulated statistics430 within the distributed computing environment, through use of one ormore software classes 522-v. In object-oriented programming, a softwareclass may be referred to as an extensible template for creating objects,providing initial values for state (e.g., member variables) andimplementations of behavior (e.g., member functions, methods). In manycomputer programming languages, a class name may be used as a name for aclass (e.g., the template itself), the name for the default constructorof the class (e.g., a subroutine that creates objects), and as the typeof objects generated by the type. Typically, when an object is createdby a constructor of the class, the resulting object may be called aninstance of the class, and the member variables specific to the objectmay be called instance variables, to contrast with the class variablesshared across the entire class.

As shown in FIG. 5, the software classes 522 are specifically designedto perform simulations of a statistical test 114 in a distributedcomputing environment. The software classes 522 may include at least abase software class 522-1 for a statistical test 114 and a virtualsoftware class 522-2 for managing the simulation of a statistical test.In one embodiment, for example, a base software class 522-1 may beimplemented as a TK-extension class. In one embodiment, for example, avirtual software class 522-2 may be implemented as a virtualTK-extension class (TKVRT). Embodiments, however, are not limited tothese examples.

The base software class 522-1 may include an extensible template tocreate objects, provide initial values for states, and implementationsof behavior for use by a software module to perform a statistical test.The virtual software class 522-2 may include an extensible template tocreate objects, provide initial values for states, and implementationsof behavior for use by the separate software module having a basesoftware class 522-1 for the statistical test, the base software class522-1 to comprise a child of the virtual software class 522-2. Thevirtual software class 522-2 may be used to extend the base softwareclass 522-1 when used with a particular computing system, such as adistributed computing system. This allows standard statistical test codeusing the base software class 522-1 to take advantage of parallelprocessing algorithms implemented by the distributed computingenvironment, without having to make modifications to the base softwareclass 522-1. The software classes 522 may be described in more detailwith reference to FIGS. 8-11, infra.

FIG. 6 illustrates a diagram for a computing system 600. The computingsystem 600 may be representative of a computing system suitable forimplementing the automated statistical test system 100.

As shown in FIG. 6, the computing system 600 includes a computingenvironment 606 designed for processing large amounts of data for manydifferent types of applications, such as for scientific, technical orbusiness applications that require a greater number of computerprocessing cycles. The computing environment 606 may include differenttypes of computing systems, such as a centralized computing system 608and a distributed computing system 610. Client devices 602-e caninteract with the computing environment 606 through a number of ways,such as over a network 604. The network 604 may comprise a publicnetwork (e.g., the Internet), a private network (e.g., an intranet), orsome combination thereof.

One or more data stores 660 are used to store the data to be processedby the computing environment 606 as well as any intermediate or finaldata generated by the computing system in non-volatile memory. Howeverin certain embodiments, the configuration of the computing environment606 allows its operations to be performed such that intermediate andfinal data results can be stored solely in volatile memory (e.g., RAM),without a requirement that intermediate or final data results be storedto non-volatile types of memory (e.g., disk).

This can be useful in certain situations, such as when the computingenvironment 606 receives ad hoc queries from a user and when responses,which are generated by processing large amounts of data, need to begenerated on-the-fly (e.g., in real time). In this non-limitingsituation, the computing environment 606 is configured to retain theprocessed information within memory so that responses can be generatedfor the user at different levels of detail as well as allow a user tointeractively query against this information.

A client device 602 may implement portions of the automated statisticaltest system 100, such as the simulation subsystem 120, for example. Whenthe simulation subsystem 120 executes, and the statistic simulatorcomponent 122-2 initiates simulation operations, the simulation controlengine 124 of the statistic simulator component 122-2 may generate asimulation request 530 and send the simulation request 530 to thecomputing environment 606 via the network 604. The computing environment606 may receive the simulation request 530, and when the simulationrequest 530 indicates a need for centralized computations, the computingenvironment 606 may forward the simulation request to the centralizedcomputing system 608 for simulation operations. When the simulationrequest 530 indicates a need for distributed computations (e.g.,parallel processing operations), the computing environment 606 mayforward the simulation request 530 to the distributed computing system610 for simulation operations. The computing systems 608, 610 may beintegrated with, or capable of interaction with, a database managementsystem (DBMS) 612 used to control and manage interaction with the datastores 660. The data stores 660 may include, for example, the simulationdatabase 340, as well as other data needed for a given simulation.

FIG. 7 illustrates a diagram of a distributed computing system 610. Thedistributed computing system 610 may include one or more client devices,such as client device 602, and two or more data processing nodes 702,704. The nodes 702, 704 may have any of the computer systemconfigurations as described with reference to FIG. 5.

The statistic simulator component 112-2 may simulate statistics with thedistributed computing system 610 via the simulation control engine 124.In one embodiment, the distributed computing system 610 may comprisemultiple data processing nodes each having multi-core data processors,with at least one of the data processing nodes designated as a controldata processing node (“control node”) and multiple data processing nodesdesignated as worker data processing nodes (“worker node”).

The client device 602 may couple to a central process, or control node702, which, in turn, is coupled to one or more worker nodes 704. Ingeneral, each of the nodes of the distributed computing system 610,including the control node 702, and worker nodes 704-1, 704-2, and704-f, may include a distributed computing engine (DCE) 706 thatexecutes on a data processor associated with that node and interfaceswith buffer memory 708 also associated with that node. The DCE 706 maycomprise an instance of the distributed computing engine 124 of thestatistical test component 122-2 of the simulation subsystem 120. Eachof the nodes may also optionally include an interface to the DBMS 612and the data stores 660, or local implementations of both (not shown).

In various embodiments, the control node 702 may manage operations inone or more of the worker nodes 704. More particularly, the control node702 may be arranged to receive and process a simulation request 530 fromthe client device 602 when distributed computations are to be performedwith data stored in one or more of the worker nodes 704.

In various embodiments, one or more of the components of distributedcomputing system 610 may be collocated, including the client device 602,control node 702, and one or more worker nodes 704. However, moregenerally, none of the components of distributed computing system 610need be collocated. Furthermore, in some embodiments, more than one nodeof the distributed computing system 610 may be arranged to assume therole of the control node. Thus, in some scenarios, the componentdesignated as the control node 702 may assume the role of a worker node,while one of the worker nodes 704-1 to 704-f may assume the role of thecontrol node 702.

In various embodiments, in operation a simulation request 530 may bereceived by the control node 702 to simulate data and/or statistics fora statistical test, as described previously with respect to FIG. 1. Forexample, the client device 602 may generate a simulation request 530 toperform a statistical test simulation, which is processed by the controlnode 702 to construct work requests to be performed by one or moreworker nodes 704.

In particular embodiments, a simulation request 530 generated by clientdevice 602 may be received with a name for the distributed computingsystem 610 to process the simulation request 530. Accordingly, when thedistributed computing system 610 is designated, the simulation request530 is transmitted to control node 702.

Consistent with the present embodiments, when the control node 702receives a simulation request 530 sent from the client device 602, thecontrol node 702 may unpack the simulation request 530, parse thesimulation request 530, and establish a flow of execution steps toperform an operation such as an simulating statistics using one or moreworker nodes 704 of the distributed computing system 610.

As illustrated in FIG. 7, the distributed computing system 610 mayfurther include a communication protocol such as the message passinginterface (MPI) 710. When the control node 702 establishes a flow ofexecution for a simulation request 530, the control node 702 maydistribute the execution steps to worker nodes 704-1 to 7041 via themessage passing interface 710. Subsequently, results may be returnedfrom one or more worker nodes 704-1 to 704-f to the control node 702 viathe message passing interface 710.

In various embodiments, each of multiple worker nodes 704-1 to 704-f maycontain a respective partition of data to be processed according to thecompute request. The control node 702 may establish an execution flow inwhich messages are sent to multiple different worker nodes 704-1 to704-f. Each worker node 704-1 to 704-f may subsequently load and executea specified simulation function for the partition of data contained bythat worker node.

When each of the worker nodes 704-1 to 704-f, that receives a message toexecute a simulation function from control node 702, completes executionof its specified simulation function on its partition of data, theworker node 704 may return results to the control node 702 through themessage passing interface 710. The results may subsequently be returnedfrom the control node 702 to the client device 602 that generated thesimulation request 530.

Although FIG. 7 illustrates a distributed database network 172 thatcomprises a control node 702 and multiple worker nodes 704-f, moregeneral embodiments include any network in which an interface isprovided so that a client device may initiate the execution of a computerequest within a group of foreign machines, utilize resources of theforeign machines, including memory, input/output functionality, loadingof images, launching of threads, and/or utilize a distributed databasestructure to send and receive message instructions and results.

FIG. 8 illustrates one example of a logic flow 800. The logic flow 800may be representative of some or all of the operations executed by oneor more embodiments described herein, such as the statistical testcomponent 122-2 of the simulation subsystem 120 of the automatedstatistical test system 100.

In the illustrated embodiment shown in FIG. 8, the logic flow 800 maygenerate simulated data for a statistical test, the statistics of thestatistical test based on parameter vectors to follow a probabilitydistribution of a known or unknown form at block 802. For example, thesimulated data component 122-1 may generate simulated data 330 for astatistical test 114, the statistical test 114 based on parametervectors (points) to follow a probability distribution.

The logic flow 800 may simulate statistics for the parameter vectorsfrom the simulated data with a distributed computing system comprisingmultiple nodes each having one or more processors capable of executingmultiple threads, the simulation to occur by distribution of portions ofthe simulated data across the multiple nodes of the distributedcomputing system at block 804. For example, the simulated data generator320 of the statistic simulator component 122-2 may simulate statisticsfor parameter vectors from the simulated data 330, where each parametervector to comprise a single point in a grid of points. The simulationmay be performed using a distributed computing system 610 comprisingmultiple nodes 702, 704, each having one or more processors capable ofexecuting multiple threads. The simulation may occur by distribution ofportions of the simulated data 330 across the multiple nodes 702, 704 ofthe distributed computing system 610.

The logic flow 800 may control task execution on the distributedportions of the simulated data on each node of the distributed computingsystem with a virtual software class arranged to coordinate task andsub-task operations across the nodes of the distributed computing systemat block 806. For example, the simulation control engine 124 of thestatistical test component 122-2 may control task execution to simulatestatistics 430 from the distributed portions of the simulated data 330on each node 702, 704 of the distributed computing system 610 with avirtual software class 522-2 arranged to assist in coordinating task andsub-task operations across the nodes 702, 704 of the distributedcomputing system 610.

FIG. 9 illustrates one example of a logic flow 900. The logic flow 900may be representative of some or all of the operations executed by oneor more embodiments described herein, such as the simulation controlengine 124 of the statistical test component 122-2 of the simulationsubsystem 120 of the automated statistical test system 100, on thedistributed computing system 610. More particularly, logic flow 900illustrates the simulation control engine 124 creating an instance of avirtual software class 522-2 on one or more nodes of the distributedcomputing system 610.

In some cases, simulation tasks may be implemented by multiple nodes702, 704 arranged in soloist architecture or a general/captainarchitecture. In a soloist architecture, simulations may be performed bya centralized computing system 608. In a general/captain architecture,simulations may be performed by a distributed computing system 610,where a control node 702 is designated as a general node, and one ormore worker nodes 704 may be designated as captain nodes.

As shown in FIG. 9, the logic flow 900 may perform initializing andparsing operations at block 902. A call to an instance of software classtksimDoAnalysis may be made to initiate task analysis at block 904. Asubroutine named DoAnalysis(.) to perform the task analysis may beexecuted at block 906. Control is passed at point A.

When in a general/captain mode, control is passed at point B to thegeneral node, a subroutine for task initialization may be executed atblock 910. At general start, a subroutine named ManageInformation(.):Message Loop may be executed at block 912. A test whether the task isanalysis is performed at diamond 914. If the test is not passed, variousclean up procedures are called and general processing terminates. If thetest is passed, subroutines TaskManager(.), Zathread(.), Launcher(.) andDoAnalysis(.) are executed in a recursive manner at block 916. Controlis passed at point C. Control is returned to the general node at pointD.

The ManageInformation(.): Message Loop executed at block 912 maybroadcast instructions to one or more captain nodes. The captain nodesperform operations similar to the general node for portions of thesimulation. For instance, at captain start, a subroutine namedManageInformation(.): Message Loop may be executed at diamond 922. Atest whether the task is analysis is performed at diamond 922. If thetest is not passed, various clean up procedures are called and captainprocessing terminates. If the test is passed, subroutinesTaskManager(.), Zathread(.), Launcher(.) and DoAnalysis(.) are executedin a recursive manner at block 924. Control is passed at point E.Control is returned to the captain node at point F.

FIG. 10 illustrates one example of a logic flow 1000. The logic flow1000 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the simulation controlengine 124 of the statistical test component 122-2 of the simulationsubsystem 120 of the automated statistical test system 100. Moreparticularly, the logic flow 1000 interoperates with the logic flow 900at the various control locations A-F.

As shown in the logic flow 1000, when control is passed at controllocation A from the logic flow 900, a determination is made as towhether task analysis is to be performed in a soloist architecture or ageneral/captain architecture at diamond 1032. If a soloist architecture,then subroutines CreateParentTKVRTInstance(.) and tkvrtGridInitialize(.)are executed at block 1036. A loop starts to execute subroutinesExecuteTheThreads(str, TASK_ANALYSIS) and tkvrtGridSummarize(.) at block1038. Control is passed at point A. If not a soloist architecture, thena determination is made as to whether task analysis is to be performedin a general/captain architecture at diamond 1034. If a general/captainarchitecture, then control is passed at control location B to the logicflow 900.

When control is passed at control location C from the logic flow 900,the general node may execute a subroutine GridTask(str, TASK_ANALYSIS)at block 1040, a subroutine MPI_Bcast(TASK_ANALYSIS) at block 1042, anda CreateParentTKVRTInstance(.) and tkvrtGridInitialize(.) at block 1044.A loop starts to execute subroutines ExecuteTheThreads(str,TASK_ANALYSIS) and tkvrtGridSummarize(.) at block 1046. Once the loopcompletes, the general node executes a subroutine MPIBcast(TASK_LOCALSTOP, . . . ) at block 1048. Parameters TASK_ANALYSISand/or TASK_LOCALSTOP are passed to the block 1050, and control ispassed at control location D to the logic flow 900.

Certain subroutines executed by the general node are designed tointeroperate with subroutines executed by the captain node to coordinatecompletion of tasks and sub-tasks. For instance, when the general nodeexecutes subroutines CreateParentTKVRTInstance(.) andtkvrtGridInitialize(.) at block 1044, and the loop at block 1046,messages and parameters may be exchanged in similar subroutines executedby the captain node at corresponding blocks 1056, 1058, respectively, tocoordinate task and sub-task completion. Such communication betweengeneral node and captain nodes may be necessary for some complexalgorithms; however, for algorithms in which the tasks and sub-tasks areindependent, no such communication is needed and execution cost issaved.

When control is passed at control location E from the logic flow 900,the captain node may start a loop to execute subroutines GridTask(str,TASK_UNKNOWN) and MPI_Bcast(task, . . . ) at block 1050. A determinationis made as to whether analysis is complete at diamond 1052 using theTASK_ANALYSIS parameter. If the TASK_ANALYSIS parameter is evaluated asTRUE, the subroutines at blocks 1056, 1058 are executed, and control ispassed back to block 1050. If the TASK_ANALYSIS parameter is evaluatedas FALSE, a determination is made as to whether a local stop hasoccurred at diamond 1054 using the TASK_LOCALSTOP parameter. If theTASK_LOCALSTOP parameter is evaluated as TRUE, control is passed atcontrol location F. If the TASK_LOCALSTOP parameter is evaluated asFALSE, control is passed back to block 1050.

FIG. 11 illustrates one example of a logic flow 1100, which shows how tofinish the tasks and sub-tasks in parallel in the multithreadenvironment. The logic flow 1100 may be representative of some or all ofthe operations executed by one or more embodiments described herein,such as the simulation control engine 124 of the statistical testcomponent 122-2 of the simulation subsystem 120 of the automatedstatistical test system 100. More particularly, the logic flow 1100illustrates certain operations for subroutines executed at blocks 1038,1046 and 1058 of the logic flow 1000.

As shown in the logic flow 1100, when the subroutineExecuteTheThreads(.) is executed at blocks 1038, 1046 and 1058 of thelogic flow 1000, thread execution 1170 executes subroutinesInitializeParentThread(.) and tkvrtInitialize(parentInst) at block 1172.The thread execution 1170 then starts a Loop for all child to executesubroutines threadsInitializeChildThreads(.) andtkvrtInitialize(childInst) at block 1174. The thread execution 1170 thenstarts an event loop to execute subroutines InitializeChildThreads(.)and tkvrtInitialize(childInst) at block 1176. The thread execution 1170then executes subroutines AccumulateChildThreads(.) andtkvrtSummarize(parentInst) at block 1178.

In one embodiment, the simulation control engine 124 may control threadexecution 1170 for each node 702, 704 of the distributed computingsystem 610 with a various instances of a virtual software class 522-2.The virtual software class 522-2 may be arranged to control taskoperations across the nodes 702, 704 of the distributed computing system610 while reducing dependency between tasks and sub-tasks. The logicflow 1100 illustrates an example for a virtual software class 522-2called TKVRT extension 1180.

In various embodiments, the simulation control engine 124 may pass orreceive one or more virtual software class parameters for each instanceof a virtual software class, the one or more parameters comprising atleast one of input/output parameters, input/output tables, or a pointerto list all instances of virtual software class parameters. Forinstance, with respect to TKVRT extension 1180, the simulation controlengine 124 may pass or receive one or more virtual software classparameters for each instance of TKVRT, including tkvrtParmsPtr,input/output parameters, input/output tables, and a pointer to list allinstances of tkvaParmPtrs. The TKVRT extension 1180 may also includeseveral subroutines as used in logic flow 900, 1000.

In one embodiment, the simulation control engine 124 may initialize aparent thread with parent parameters with a first instance of thevirtual software class TKVRT extension 1180, which includestkvrtinitialize(parentinst) as shown in block 1184.

In one embodiment, the simulation control engine 124 may initialize achild thread with child parameters with a first instance of the virtualsoftware class TKVRT extension 1180, which includestkvrtinitialize(childinst) as also shown in block 1184.

In one embodiment, the simulation control engine 124 may analyze workresults of a child thread with a second instance of the virtual softwareclass TKVRT extension 1180, which includes tkvrtAnalyze(childInst) asshown in block 1186.

In one embodiment, the simulation control engine 124 may summarize workresults of a child thread to a parent thread with a third instance ofthe virtual software class TKVRT extension 1180, which includestkvrtSummarize(parentInst) as shown in block 1188.

In one embodiment, the simulation control engine 124 may initialize agrid with parent parameters with a fourth instance of the virtualsoftware class TKVRT extension 1180, which includestkvrtGridInitialize(parentInst) as shown in block 1190.

In one embodiment, the simulation control engine 124 may summarize agrid with parent parameters with a fifth instance of the virtualsoftware class TKVRT extension 1180, which includestkvrtGridSummarize(parentInst) as shown in block 1192.

It may be appreciated that these are merely a few example subroutinesfor the TKVRT extension 1180, and others exist as well. Embodiments arenot limited in this context.

FIG. 12 illustrates one example of a logic flow 1200. The logic flow1200 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the simulation controlengine 124 of the statistical test component 122-2 of the simulationsubsystem 120 of the automated statistical test system 100. Moreparticularly, the logic flow 1200 illustrates distribution algorithmsfor use with the distributed computing system 610.

As shown in FIG. 12, the logic flow 1200 may generate simulated data fora statistical test, the statistics of the statistical test based onparameter vectors to follow a probability distribution at block 1202.For example, the simulated data component 122-1 may generate simulateddata 330 for a statistical test 114, the statistics of the statisticaltest 114 based on parameter vectors to follow a probability distributionof a known or unknown form.

The logic flow 1200 may simulate statistics for the parameter vectorsfrom the simulated data, each parameter vector to comprise a singlepoint in a grid of points, with a distributed computing systemcomprising multiple nodes each having one or more processors capable ofexecuting multiple threads, the simulation to occur through distributionof portions of the simulated data or simulated statistics across themultiple nodes of the distributed computing system in accordance with acolumn-wise or column-wise-by-group distribution algorithm at block1204. For example, the simulated statistic generator 420 of thestatistic simulator component 122-2 may simulate statistics for theparameter vectors from the simulated data 330. Each parameter vector forthe statistical test 114 may comprise a single point in a grid ofpoints, with the grid of points to be used for interpolation. Thesimulation may be performed with a distributed computing system 610comprising multiple nodes 702, 704. Each node 702, 704 may have one ormore processors capable of executing multiple threads. The simulationcontrol engine 124 of the statistic simulator component 122-2 maycontrol simulation of the statistical test 114 by distributing portionsof the simulated data 330 and/or simulated statistics 430 across themultiple nodes 702, 704 of the distributed computing system 610 inaccordance with a column-wise or column-wise-by-group distributionalgorithm. A column-wise or column-wise-by-group distribution algorithmmay be described in more detail with reference to FIGS. 13-17, infra.

The logic flow 1200 may create a computational representation arrangedto generate an approximate probability distribution for each point inthe grid of points from the simulated statistics, the approximateprobability distribution to comprise an empirical cumulativedistribution function (CDF) at block 1206. For example, the codegenerator component 124 may create a computational representation 130,such as a DLL file. The computational representation 130 may be arrangedto generate an approximate probability distribution 132 for each pointin the grid of points from the simulated statistics 430. The approximateprobability distribution 132 may comprise an empirical CDF, for example.

FIG. 13 illustrates an example of a simulated data structure 1300. Thesimulated data structure 1300 may be a software data structure arrangedto store simulated data 330 and/or simulated statistics 430 in thesimulation database 340.

The statistic simulator component 122-2 may generate the simulated datastructure 1300. In one embodiment, the statistic simulator component122-2 may generate the simulated data structure 1300 as a table. Thesimulated data structure 1300 may include an ordered arrangement of rows1302-g and columns 1304-h to form multiple cells 1306-i. A cell 1306 maycontain a simulation of a simulated statistic 430 (or simulated data330) for a point in the grid of points, where each row 1302 represents asimulation of the simulated statistic 430 (or simulated data 330), andeach column 1304 represents a point in the grid of points.

When populated, the simulated data structure 1300 may have a defineddata storage size for a given statistical test 114. For instance, withthe maxF test, the simulated data structure 1300 may comprise 1,000,000rows and 103,780 columns, which gives the simulated data structure 1300a data storage size of approximately 800 Gigabytes (GB).

FIG. 14 illustrates an example of an operational environment 1400. Theoperational environment 1400 shows distributing portions of thesimulated data structure 1300 as column-based work units for thedistributed computing system 610.

The simulation control engine 124 of the statistic simulator component122-2 may control simulation of the statistical test 114 by distributingportions of the simulated data structure 1300 across the multiple nodes702, 704 of the distributed computing system 610 in accordance with acolumn-wise distribution algorithm. For instance, the simulation controlengine 124 may distribute the simulated data structure 1300 by columnacross multiple worker nodes 704 of the distributed computing system610.

The DCE 706 of the control node 702 may distribute one or more columns1304-h of the simulated data structure 1300 to one or more worker nodes704 via the message passing interface 710. As shown in FIG. 14, the DCE706 may distribute columns 1304-1, 1304-2 . . . 1304-h of the simulateddata structure 1300 as work units to the worker nodes 704-1, 704-2 . . .704-f, respectively. A worker node 704 may process its assigned workunit, such as sorting each column 1304 and/or calculating quantiles forthe statistical test 114. The worker nodes 704 may pass their processedwork units, or pointers to the processed work units, to the DCE 706 viathe message passing interface 710. The DCE 706 may reassemble theprocessed work units into an output file to form a new version of thesimulated data structure 1300.

In one embodiment, the new version of the simulated data structure 1300may include an ordered arrangement of rows and columns, each row torepresent a point in the grid of points and each column to represent aquantile for each point in the grid of points. In the case where theworker nodes 704 are tasked to calculate quantiles for the statisticaltest 114, the worker nodes 704 may pass back a defined number ofquantiles as established for the statistical test 114. For instance,with the maxF test, the original simulated data structure 1300 maycomprise 1,000,000 rows and 103,780 columns, which gives the originalsimulated data structure 1300 a data storage size of approximately 800Gigabytes (GB). Assume the worker nodes 704 are to calculate 10,001quantiles for the maxF test. In this case, the new simulated datastructure 1300 may comprise 10,001 columns and 103,780 rows, which givesthe new simulated data structure 1300 a reduced data storage size ofapproximately 8 GB.

In one embodiment, the statistic simulator component 122-2 may generatequantiles using the distributed computing system 610 in accordance withan exemplary procedure, as follows:

  PROC HPSIMULATE   data=scbpSimulation /* output of simulation withgroup head */   datadist=(COLUMNWISEBY);  MODULE name=SCBP   ext=tkscbp/* TK Extension to plug-in */   var=(c:) /* all columns */   task=1 /*Task : Post-processing */;  OUTPUT out=scbpQuantiles;  PERFORMANCEnnodes=200 nthreads=6; RUN.Embodiments are not limited this example.

FIG. 15 illustrates an example of a simulated data structure 1500. Thesimulated data structure 1500 may be a software data structure arrangedto store simulated data 330 and/or simulated statistics 430 in thesimulation database 340.

The statistic simulator component 122-2 may generate the simulated datastructure 1500. In one embodiment, the statistic simulator component122-2 may generate the simulated data structure 1500 as a table. Thesimulated data structure 1500 may include an ordered arrangement of rows1502-j and columns 1504-k to form multiple cells 1506-m. A cell 1506 maycontain a simulation of a simulated statistic 430 (or simulated data330) for a point in the grid of points, where each row 1502 represents asimulation of the simulated statistic 430 (or simulated data 330), andeach column 1504 represents a point in the grid of points. Additionally,the simulated data structure 1500 may be organized into column groups1508-n. For instance, a first column group 1508-1 may include sixcolumns for parameter vector 4, and a second column group 1508-2 mayinclude five columns for parameter vector 5, and so forth.

As with simulated data structure 1300, the simulated data structure 1500may have a defined data storage size for a given statistical test 114.For instance, with the maxF test, the simulated data structure 1500 maycomprise 1,000,000 rows and 103,780 columns, which gives the simulateddata structure 1500 a data storage size of approximately 800 Gigabytes(GB).

FIG. 16 illustrates an example of an operational environment 1600. Theoperational environment 1600 shows distributing portions of thesimulated data structure 1500 as column-group-based work units for thedistributed computing system 610.

The simulation control engine 124 of the statistic simulator component122-2 may control simulation of the statistical test 114 by distributingportions of the simulated data structure 1500 across the multiple nodes702, 704 of the distributed computing system 610 in accordance with acolumn-wise-by-group distribution algorithm. For instance, thesimulation control engine 124 may distribute the simulated datastructure 1500 by groups of columns (or column groups) across multipleworker nodes 704 of the distributed computing system 610. Distributingthe simulated data structure 1500 may make it easier to calculate thesimulated statistic 430 for each point in the grid of points relative tothe column-wise distribution algorithm.

The simulation control engine 124 may perform column group distributionaccording to column groups 1508-n defined in a control row of thesimulated data structure 1500. The control row may include variousidentifiers or parameters to control distribution. In one embodiment,for example, the control row may include a group identifier to identifycorresponding columns in a group, a restriction identifier to identifycorresponding columns that do not need to be distributed, and auniversal identifier to identify corresponding columns that need to bedistributed across all worker nodes. It may be appreciated that otheridentifiers and parameters may be used as desired for a givenimplementation. Embodiments are not limited in this context.

The DCE 706 of the control node 702 may distribute one or more columngroups 1508-n of the simulated data structure 1500 to one or more workernodes 704 via the message passing interface 710. As shown in FIG. 16,the DCE 706 may distribute columns 1508-1, 1508-2 . . . 1508-n of thesimulated data structure 1500 as work units to the worker nodes 704-1,704-2 . . . 704-f, respectively. A worker node 704 may process itsassigned work unit, such as calculating the statistics for thestatistical test 114, based on the column groups, and then calculatingquantiles for the statistical test 114. The worker nodes 704 may passtheir processed work units, or pointers to the processed work units, tothe DCE 706 via the message passing interface 710. The DCE 706 mayreassemble the processed work units into an output file to form a newversion of the simulated data structure 1500.

In one embodiment, the new version of the simulated data structure 1500may include an ordered arrangement of rows and columns, each row torepresent a point in the grid of points and each column to represent aquantile for each point in the grid of points. In the case where theworker nodes 704 calculate quantiles for the statistical test 114, aswith the simulated data structure 1300, the worker nodes 704 may passback a defined number of quantiles as established for the statisticaltest 114. For instance, with the WDmaxF test, the original simulateddata structure 1500 may comprise 1,000,000 rows and 103,780 columns ofmaxF test statistics, which gives the original simulated data structure1500 a data storage size of approximately 800 Gigabytes (GB). Assume theworker nodes 704 are to calculate 10,001 quantiles for the WDmaxF test.In this case, the new simulated data structure 1500 may comprise 10,001columns and 103,780 rows, which gives the new simulated data structure1500 a reduced data storage size of approximately 8 GB.

FIG. 17 illustrates an example of a simulated data structure 1700. Thesimulated data structure 1700 may illustrate an example of the newversions of the simulated data structures 1300, 1500. As described withreference to FIGS. 13-16, new versions of the simulated data structures1300, 1500 may each include an ordered arrangement of rows 1702-p andcolumns 1704-q, each row 1702 to represent a point in the grid of pointsand each column 1704 to represent a quantile of the grid of points.Simulated data structure 1700 is transposed relative to the simulateddata structures 1300, 1500, in that the simulated data structures 1300,1500 have columns representing points in a grid of points, while thesimulated data structure 1700 has columns representing quantiles.

FIG. 18 illustrates one example of a logic flow 1800. The logic flow1800 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the statisticsimulator component 122-2 of the simulation subsystem 120 of theautomated statistical test system 100. More particularly, the logic flow1800 illustrates curve fitting algorithms for use with a grid of points.

As shown in FIG. 18, the logic flow 1800 may generate simulated data fora statistical test, statistics of the statistical test based onparameter vectors to follow a probability distribution at block 1802.For example, the simulated data component 122-1 may generate simulateddata 330 for a statistical test 114, the statistical test 114 based onparameter vectors to follow a probability distribution of known orunknown form. Alternatively, the simulated data component 122-1 mayreceive simulated data 330 for a statistical test 114 from an externalsource.

The logic flow 1800 may simulate statistics for the parameter vectorsfrom the simulated data, each parameter vector to comprise a singlepoint in a grid of points at block 1804. For instance, the statisticsimulator component 122-2 may generate simulated statistics 430 for theparameter vectors from the simulated data 330, each parameter vector tocomprise a single point in a grid of points.

The logic flow 1800 may calculate quantiles for the parameter vectorsfrom the simulated data at block 1806. For instance, the statisticsimulator component 122-2 may calculate quantiles saved in the simulateddata structure 1700 for the parameter vectors from the simulated data330.

The logic flow 1800 may fit an estimated CDF curve to quantiles for eachpoint in the grid of points using a monotonic cubic spline interpolationtechnique in combination with a transform to satisfy a defined level ofprecision at block 1808. For instance, the statistic simulator component122-2 may construct an estimated CDF curve for each point in the grid ofpoints using a monotonic cubic spline interpolation technique incombination with a transform to interpolate quantiles in the simulateddata structure 1700 in order to satisfy a precision level of interest.

Once the simulation control engine 124 generates the simulated datastructure 1700 with quantiles for the statistical test 114, thestatistic simulator component 122-2 may use the quantiles to fit anestimated CDF curve for each point in the grid of points. The statisticsimulator component 122-2 may fit an estimated CDF for each pointaccording to a given level of precision. In general, reducing a level ofprecision results in a corresponding reduction in a number of curveparameters needed to fit the estimated CDF curve.

As previously described with reference to FIG. 2, the statisticsimulator component 122-2 may simulate statistics for all givenparameter vectors (p) for a statistical test (T) from the simulated data330. In accordance with Equation (1), the empirical CDF {tilde over(T)}(p, x) may have a precision of approximately 1/√{square root over(N)}, where N is the sample size, or the number of simulated statistics,for the given parameter vector p. For example, when N=1,000,000, theprecision is about 0.001. However, the statistic simulator component122-2 may generate an estimated CDF curve with much fewer curveparameters than N.

The statistic simulator component 122-2 may select a number of curveparameters to fit an estimated CDF curve for each point in the grid ofpoints to provide a given level of precision. For instance, assume thata precision level is set as 0.0005, and that a monotonic cubic splineinterpolation technique is used to fit the curve. On average,approximately 20 curve parameters can achieve a curve C(c (p), . . . )as set forth in Equation (2), as follows:

$\begin{matrix}{{\max\limits_{x}{{{C\left( {{c(p)},x} \right)} - {\overset{\sim}{T}\left( {p,x} \right)}}}} \leq 0.0005} & {{Equation}\mspace{14mu} (2)}\end{matrix}$

where c(p) denotes the point-dependent curve parameters.

In some cases, however, a number of curve parameters may be reducedthrough combination of a monotonic cubic spline interpolation techniqueand a transform. In one embodiment, for example, the statistic simulatorcomponent 122-2 may combine a monotonic cubic spline interpolationtechnique with a beta transformation. A beta transformation is atransform performed in accordance with a normalized incomplete betafunction, the normalized incomplete beta function comprising anonnegative function whose derivative is completely positive. In oneembodiment, a beta function may comprise a CDF of a beta distribution. Abeta distribution is a family of continuous probability distributionsdefined on the interval [0, 1] parameterized by two positive shapeparameters, denoted by α and β, that appear as exponents of the randomvariable and control the shape of the distribution.

Assume the monotonic cubic spline interpolation technique fits a firstestimated CDF curve with a first number of knots to give a first levelof precision (0.0005), each knot comprising an x value and a y value fora two-dimensional coordinate system. The monotonic cubic splineinterpolation technique spaces the x values at regular intervals alongthe x-axis as it is monotonic. As such, more knots are needed toaccurately fit the curve. The monotonic cubic spline interpolationtechnique may be combined with a beta transformation to transform the xvalues to reduce the first number of knots to a second number of knotsthat gives approximately the first level of precision (0.0005), wherethe second number of knots is lower than the first number of knots.Applying the beta transformation causes the x values to be placed atirregular intervals, which reduces the number of knots.

Combining a monotonic cubic spline interpolation technique with atransform, such as the beta transformation, results in fewer curveparameters needed for a same or similar level of precision. Forinstance, in the previous example, the use of the monotonic cubic splineinterpolation technique reduced a number of curve parameters from1,000,000 simulated statistics to approximately 20 curve parameters. Bycombining the monotonic cubic spline interpolation technique with a betatransformation, the number of curve parameters may be further reducedfrom 20 curve parameters to 12 curve parameters, for a same or similarlevel of precision (e.g., 0.0005).

Once a number of curve parameters are selected, the statistic simulatorcomponent 122-2 may fit an estimated CDF curve for each point in thegrid of points independently from other points in the grid of pointsusing the selected number of curve parameters to provide a given levelof precision. Fitting an estimated CDF curve for each pointindependently significantly reduces computational resources needed forcurve-fitting operations. For instance, in a simple case that the pointis one dimensional, rather than fitting estimated CDF curves for allpoints in the grid of points simultaneously to build an actualthree-dimensional surface, the statistic simulator component 122-2 fitsan estimated curve for each point in sequence or parallel, and thencombines the estimated curves to form an approximate three-dimensionalsurface.

Once curve-fitting operations are finished, the statistic simulatorcomponent 122-2 may generate a simulated data structure with informationfor a set of fitted CDF curves for the grid of points. Continuing withthe maxF test example, the simulated data structure may have a datastorage size calculated as 8 GB/10,001*12=10 megabytes (MB). Asindicated with the maxF test example, a data storage size for eachversion of a simulated data structure reduces from 800 GB to 8 GB to 10MB. This results in a significantly smaller data storage size needed forthe computational representation 130.

In one embodiment, the statistic simulator component 122-2 may performcurve-fitting operations in accordance with the following exemplaryprocedure:

  PROC HPSIMULATE   data=scbpQuantiles /* output of quantiles */  datadist=(RO UNDROBIN);  MODULE name=fitcdf   ext=tkdens /* TKExtension to plug-in */   var=(key1 − key3 q0 - q10000) /* keys andquantiles */   task=0 /* Task : Fit CDF curves */  taskParmN=(/*nKeys=*/3 /*maxParm=*/32 /*maxIter=*/10000   /*precision=*/0.0005/*maxModels=*/1 /*weightTails=*/0   /*weightA=*/-4.605 /*weightB=*/5.685/*transType=*/1   /*transGridL=*/-2.0 /*transGridU=*/2.0/*transGridS=*/0.1 );  OUTPUT out=scbpFitCDFCurves;  PERFORMANCEnnodes=200 nthreads=6; RUN.Embodiments are not limited to this example.

FIG. 19 illustrates an operational environment 1900. The operationalenvironment 1900 shows operations for the code generator component 122-3to generate interpolation code to interpolate statistics for astatistical test 114.

The simulated data component 122-1 may generate simulated data 330 for astatistical test 114, the statistics of the statistical test 114 basedon parameter vectors to follow a probability distribution of a known orunknown form. The statistic simulator component 122-2 may generatesimulated statistics 430 for the parameter vectors from the simulateddata 330, each parameter vector to comprise a single point in a grid ofpoints. The code generator component 122-3 may remove selective pointsfrom the grid of points to form a subset of points, and generateinterpolation code to interpolate a statistic of the statistical test114 on any point.

As shown in FIG. 19, the code generator component 122-3 may receive asimulated data structure 1910. The simulated data structure 1910 mayinclude information for a set of fitted CDF curves for the grid ofpoints, as described with reference to FIG. 18. The code generatorcomponent 122-3 may include an interpolation code generator 1920 toexecute an interpolation function 1922.

In various embodiments, the interpolation code generator 1920 maygenerate interpolation source code 1930 from the simulated datastructure 1910 and a pair of interpolation functions 1922, 1924.

The first interpolation function 1922 may be arranged to call a secondinterpolation function comprising an instance of the virtual softwareclass. The interpolation function 1922 may be an instance of a basesoftware class 522-1 designed to call an instance of a virtual softwareclass 522-2, where the base software class 522-1 is a child of thevirtual software class 522-2. In one embodiment, for example, a basesoftware class 522-1 may be implemented as a TK-extension class forinterpolating statistics of the statistical test 114, and a virtualsoftware class 522-2 may be implemented as a virtual TK-extension class(TKICDF). Embodiments, however, are not limited to this example.

The second interpolation 1924 may be an instance of the virtual softwareclass 522-2. In one embodiment, the interpolation function 1924 mayimplement a monotonic cubic spline interpolation technique. In oneembodiment, the interpolation function 1924 may implement a monotoniccubic spline interpolation technique in combination with a transform,such as the beta transformation, for example. The beta transformationmay comprise a transform with a normalized incomplete beta function (thecumulative distribution function of beta distribution), the normalizedincomplete beta function to comprise a nonnegative function whosederivative is completely positive.

Alternatively, the interpolation code generator 1920 may utilize asingle interpolation function with some or all of the characteristics ofboth interpolation functions 1922, 1924. Embodiments are not limited inthis context.

In some cases, the interpolation code generator 1920 may have anintegrated compiler 1932. The interpolation code generator 1920 maygenerate the interpolation source code 1930, and use the compiler 1932to compile the interpolation source code 1930 in order to generate aninterpolation executable code 1940. Alternatively, the compiler 1932 maybe separate from the code generator component 122-3 (e.g., part of anoperating system).

In one embodiment, the interpolation code generator 1920 may generatethe interpolation source code 1930 in accordance with the followingexemplary procedure:

  PROC HPSIMULATE   data=scbpFitCDFCurves /* output of fitted CDF curves*/   datadist=(ROUNDROBIN);  MODULE name=getCcode   ext=tkdens /* TKExtension to plug-in */   var=(key1 - key3 fit:) /* keys and fittingparameters */   task=1 /* Task : Generate source code */   taskParmN=(/*nKeys=*/3 /*bitflags=*/0 0 0 )   taskParmS=( /*OutputPath=*/“u:\\temp”,    /*TK-ExtensionFileName=*/ “imaxf”);  OUTPUTout=scbpIndexTableMaxF;  PERFORMANCE nnodes=0 nthreads=1; RUN.Embodiments are not limited to this example.

FIG. 20 illustrates one example of a logic flow 2000. The logic flow2000 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the code generatorcomponent 122-3 of the simulation subsystem 120 of the automatedstatistical test system 100. More particularly, the logic flow 2000illustrates code generation operations for use with a grid of points.

As shown in FIG. 20, the logic flow 2000 may generate simulated data fora statistical test, statistics of the statistical test based onparameter vectors to follow a probability distribution, at block 2002.For instance, the simulated data component 122-1 may generate simulateddata 330 for a statistical test 114, the statistical test 114 based onparameter vectors to follow a probability distribution of a known orunknown form.

The logic flow 2000 may simulate statistics for the parameter vectorsfrom the simulated data, each parameter vector to comprise a singlepoint in a grid of points, at block 2004. For instance, the statisticsimulator component 122-2 may generate simulated statistics 430 for theparameter vectors from the simulated data 330, each parameter vector tocomprise a single point in a grid of points.

The logic flow 2000 may remove selective points from the grid of pointsto form a subset of points at block 2006. For instance, the codegenerator component 122-3 may remove selective points from the grid ofpoints to form a subset of points. The code generator component 122-3may receive a simulated data structure 1910 with information forestimated CDF curves of the subset of points.

The logic flow 2000 may generate interpolation code to interpolate astatistic of the statistical test on any point at block 2008. Forinstance, the code generator component 122-3 may generate interpolationsource code 1930 or interpolation executable code 1940 to interpolate astatistic of the statistical test 114 on any point in the grid of pointsto form an estimated CDF curve. The interpolation code may include,among other types of information, the simulated data structure 1910,index tables for the simulated data structure 1910, and a firstinterpolation function 1922 designed to call a second interpolationfunction 1924.

The interpolation source code 1930 may be used to interpolate a CDF forany given point p for a statistical test 114. Assume the simulationsubsystem 120 is executed to simulate and fit CDFs on M points. Those Mpoints construct a grid (or mesh), which is contained in theinterpolation source code 1930 as generated by the code generatorcomponent 122-3 of the simulation subsystem 120. The compiler 1932 maycompile the interpolation source code 1930 into interpolation executablecode 1940, such as a DLL, for example. The DLL may be used tointerpolate a CDF for any given point p of the statistical test,regardless of whether p is a point within the grid of points M oroutside of the grid of points M.

FIG. 21A illustrates an operational environment 2100. The operationalenvironment 2100 shows operations for the code generator component 122-3to generate a computational representation 130 for a statistical test114.

As shown in FIG. 21A, the code generator component 122-3 may include aCDF code generator 2120. The CDF code generator 2120 may receive asimulated data structure 1910 and interpolation source code 1930 fromthe interpolation code generator 1920. The simulated data structure 1910and the interpolation source code 1930 may be integrated or separatefrom each other. The simulated data structure 1910 may includeinformation for a set of fitted CDF curves for the grid of points, asdescribed with reference to FIG. 18. The interpolation source code 1930may interpolate a statistic of the statistical test 114 on any point.

The CDF code generator 2120 may create a computational representation130 arranged to generate an approximate probability distribution 132 foreach point in the grid of points from the simulated data structure 1910.For instance, the CDF code generator 2120 may generate CDF source code2130 and/or CDF executable code 2140 via the compiler 2132. The compiler2132 may be integrated with, or separate from, the CDF code generator2120. The computational representation 130 may include the interpolationsource code 1930. The computational representation 130 may also includea set of H files, data C files, function C files, and a build script.

FIG. 21B illustrates one example of a logic flow 2150. The logic flow2150 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the CDF code generator2120 of the code generator component 122-3 of the simulation subsystem120 of the automated statistical test system 100. More particularly, thelogic flow 2150 illustrates code generation operations to generate acomputational representation 130.

As shown in FIG. 21B, the logic flow 2150 may receive a simulated datastructure 1910 with information for a set of fitted CDF curves for thegrid of points as input 2160. A process 2170 may generate source codefor a computational representation 130, as implemented in generatingsource code 2172 by incorporating template files, data, and instructionsinto the corresponding type of files. For instance, the CDF codegenerator 2120 may generate CDF source code 2130 with the simulated datastructure 1910 and interpolation source code 1930. The logic flow 2150may output various types of source code files and logic as output 2180.For instance, the CDF code generator 2120 may generate source code filesfor CDF source code 2130.

The CDF source code 2130 may include, for example, one or more H files2182. An H file 2182 may contain data structures and interface functionsfor the usage of a set of data and the interpolation based on the set ofdata. The CDF source code 2130 may include, for example, one or moredata C files 2184. A data C file 2184 may contain all fitted CDF curvessaved in a data structure and functions of using such data structure.The CDF source code 2130 may include, for example, one or more functionC files 2186. A function C file contains a function for theinterpolation based on a given set of data, such as data in thesimulated data structure 1910, for example, the set of fitted CDFcurves.

The CDF source code 2130 may also include logic implemented in the formof one or more scripts 2188. For instance, the CDF source code 2130 mayinclude a build script or make file that specifies how to build asoftware library.

FIG. 22 illustrates an operational environment 2200. The operationalenvironment 2200 shows operations for the evaluation component 122-4 toreduce a data storage size for a computational representation 130.

As shown in FIG. 22, the evaluation component 122-4 may comprise a datareduction generator 2220. The data reduction generator 2220 may receiveas input a computational representation 130 arranged to generate anapproximate probability distribution 132 for each point in a grid ofpoints from simulated statistics 430 for a statistical test 114. Thecomputational representation 130 may include a simulated data structure1910 with information for estimated CDF curves.

The data reduction generator 2220 may evaluate the simulated datastructure 1910 to determine whether any points in the grid of points isremovable from the simulated data structure 1910 given a target level ofprecision. The data reduction generator 2220 may reduce the simulateddata structure in accordance with the evaluation to produce a reducedsimulated data structure 2210. The reduced simulated data structure mayreduce a data storage size for the computational representation 130.

The data reduction generator 2220 may implement a parallel adaptive gridenhancement (PAGE) function 2222 arranged to implement a PAGE algorithm.In one embodiment, the data reduction generator 2220 may receiveselection of a precision parameter to represent a target level ofprecision for the simulated data structure 1910. The data reductiongenerator 2220 may remove points from the simulated data structure 1910in accordance with the selected level of precision utilizing the PAGEalgorithm. The PAGE algorithm may be described in more detail withreference to FIGS. 24-27, infra.

FIG. 23 illustrates one example of a logic flow 2300. The logic flow2300 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the data reductiongenerator 2220 of the evaluation component 122-4 of the simulationsubsystem 120 of the automated statistical test system 100. Moreparticularly, the logic flow 2300 illustrates data reduction operationsto reduce a data storage size for a computational representation 130.

As shown in FIG. 23, the logic flow 2300 may receive a computationalrepresentation arranged to generate an approximate probabilitydistribution for statistics of a statistical test, the computationalrepresentation to include a simulated data structure with informationfor estimated cumulative distribution function (CDF) curves for one ormore parameter vectors of the statistical test, each parameter vector tocomprise a single point in a grid of points, at block 2302. Forinstance, the data reduction generator 2220 may receive as input acomputational representation 130 arranged to generate an approximateprobability distribution 132 for each point in a grid of points fromsimulated statistics 430 for a statistical test 114. The computationalrepresentation 130 may include a simulated data structure 1910 withinformation for estimated CDF curves.

The logic flow 2300 may evaluate the simulated data structure todetermine whether any points in the grid of points are removable fromthe simulated data structure given a target level of precision at block2304. For example, the data reduction generator 2220 may evaluate thesimulated data structure 1910 to determine whether any points in thegrid of points are removable from the simulated data structure 1910given a target level of precision.

The logic flow 2300 may reduce the simulated data structure inaccordance with the evaluation to produce a reduced simulated datastructure having a smaller data storage size relative to the simulateddata structure, the reduced simulated data structure to reduce a datastorage size for the computational representation at block 2306. Forexample, the data reduction generator 2220 may reduce the simulated datastructure 1910 in accordance with the evaluation to produce a reducedsimulated data structure 2210, where the simulated data structure 2210has a smaller data storage size as compared to the simulated datastructure 1910. The reduced simulated data structure may in turn reducea data storage size for the computational representation 130.

FIG. 24 illustrates one example of a logic flow 2400. The logic flow2400 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the data reductiongenerator 2220 of the evaluation component 122-4 of the simulationsubsystem 120 of the automated statistical test system 100. Moreparticularly, the logic flow 2400 illustrates data reduction operationsto reduce a data storage size for a computational representation 130utilizing a PAGE algorithm.

In general, the logic flow 2400 may receive a computation representation130 with a simulated data structure 1910 containing information forestimated CDF curves, and evaluate the simulated data structure 1910 todetermine whether any points in the grid of points are removable fromthe simulated data structure given a target level of precision. Thelogic flow 2400 may perform the evaluation using a PAGE algorithm. Thelogic flow 2400 may then reduce the simulated data structure 1910 usingevaluation results to produce a reduced simulated data structure 2210.

As shown in FIG. 24, the logic flow 2400 may receive various inputs fora PAGE algorithm, such as an interpolation grid G₀ with M points at2402, an interpolation grid G₂ with N points at 2404, and an input tableof N rows at 2406. Each row of the input table may contain K keys and Qquantiles. The interpolation grid G₀ and/or the interpolation grid G₂may be examples of an interpolation executable code 1940. The inputtable at 2406 may be an example of a simulated data structure 1910.

The logic flow 2400 may receive selection of a precision parameter torepresent a target level of precision for the simulated data structure.The precision parameter may be automatically selected by the datareduction generator 2220 based on a defined set of rules. Alternatively,the precision parameter may be selected by a user. Once selected, thePAGE algorithm may receive as input the precision parameter, along withother control parameters, for example, the type of interpolation method,as indicated at 2408.

The logic flow 2400 may remove points from the simulated data structurein accordance with a selected level of precision utilizing the PAGEalgorithm. The PAGE algorithm may be used to identify a set of candidatepoints for potential removal from a simulated data structure. In oneembodiment, for instance, the PAGE algorithm may execute at 2410 andoutput a candidate reduction data set using the interpolation grids G₀,G₂, the input table, and the one or more control parameters. Thecandidate reduction data set may be stored in a first output table 1 asindicated at 2412. The output table 1 may include evaluationinformation. The evaluation information may include, for example, adefined number of rows N, with each row to include one or more each of Kkeys, Q explanation errors on quantiles, one or more evaluationcriteria, F fit parameters, and/or one or more flags to indicate if apoint p is to remain in an interpolation grid G₁.

The logic flow 2400 may perform a DATA operation 2414 to extract one ormore rows from the output table 1 at 2412 based on the evaluationinformation to construct a second output table 2 at 2416. For instance,output table 2 is a subset of output table 1, and it contains the rowsthat should be included in the interpolation grid G₁ and columns of keysand fit parameters. Output table 2 may be an example of a reducedsimulated data structure 2210. The logic flow 2400 may utilize the codegenerator component 122-3 at 2418 to generate the interpolation grid G₁at 2420 based on the output table 2 at 2416. The interpolation grid G₁may be an example of an interpolation executable code 1940.

In one embodiment, the PAGE algorithm may be arranged to generate thecandidate reduction data set using a “jackknife” evaluation technique. Ajackknife evaluation technique provides information regarding whether apoint may be approximated by its neighbors for a given level ofprecision. This information may be used to determine those points thatcannot be removed from the grid of points for the given level ofprecision. Once needed points are identified, the remaining points maybe stored in the candidate reduction data set. For instance, thejackknife operation may provide information on a relationship betweenprecision and grid size. Table 1 illustrates results from a jackknifeevaluation technique on all 103,780 points on the grid of points, witheach point having 10,001 quantiles, for a maxF test:

TABLE 1 Quantile Jackknife Result 100% 0.445721510  99% 0.007458065  95%0.000650852  90% 0.000596543  75% 0.000532891  50% 0.000477936  25%0.000435499  10% 0.000401377  5% 0.000382148  1% 0.000346780  0%0.000270918Table 1 illustrates that less than 1% points cannot be explained well byits neighbors when the precision requirement is 0.0075.

In one embodiment, a jackknife evaluation technique may be performed inaccordance with the following exemplary procedure:

PROC HPSIMULATE   data=scbpQuantiles /* output of quantiles */  datadist=(ROUNDROBIN);  MODULE name=evaluation   ext=tkdens /* TKExtension to plug-in */   var=(key1 - key3 q0 - q10000) /* keys andquantiles */   task=2 /* Task : Evaluate performance */   taskParmN=(/*nKeys=*/3 /*EvalType=*/1 /*weightTails=*/0    /*weightA=*/−4.605/*weightB=*/5.685    /*interpolationMethod=*/1/*interpolationMethodParm=*/5 )   taskParmS=( /*tkExtension=*/ “imaxf”);  OUTPUT out=scbpEvaluationJackknife;  PERFORMANCE nnodes=200nthreads=6; RUN.Embodiments are not limited to this example.

The PAGE algorithm may use results from the jackknife evaluationtechnique as a basis for selectively removing points from the grid ofpoints, estimating an approximation error for interpolation, and storingthe removed points in the candidate reduction data set based on theapproximation error. The PAGE algorithm may then evaluate each point inthe candidate reduction data set against a set of evaluation criterionuntil a precision parameter is satisfied.

In general, the PAGE algorithm determines, given some target level ofprecision, whether an original interpolation grid G₂ could be reducedinto a smaller interpolation grid G₁, without deleting any points froman interpolation grid G₀. The smaller interpolation grid may result in asmaller data storage size for the computational representation 130(e.g., DLL). An example for reducing a data storage size for thecomputational representation 130 may be illustrated with the followingexemplary procedure:

  PROC HPSIMULATE   data=scbpQuantiles /* output of quantiles */  datadist=(ROUNDROBIN);  MODULE name=PAGE   ext=tkdens /* TK Extensionto plug-in */   dependent   var=(key1 - key3 q0 - q10000) /* keys andquantiles */   task=3 /* Task : Shrink the DLL size */   taskParmN=(/*targetPrecision=*/0.0007 )   taskParmS=( /*G2 tkExtension=*/ “imaxf”        /*G0 tkExtension=*/ “imaxf0” );  OUTPUT out=scbpPAGE_G1; PERFORMANCE nnodes=200 nthreads=6; RUN.Embodiments are not limited to this example.

After using a PAGE algorithm according to different precisions, a gridsize with corresponding levels of precision for the maxF test may beshown in Table 2 as follows:

TABLE 2 Precision 0.0050 0.0025 0.0010 0.0007 0.0005 Grid Size 7,8689,778 13,766 17,202 103,780 (# Points) % of Original 7.6% 9.4% 13.3%16.6% 100.0% GridNote that the original grid (e.g., simulated data structure 1910) had103,780 points for a precision level of 0.0005 (≧max|·−{tilde over(T)}). As indicated by Table 2, a data storage size for the simulateddata structure 1910 may be substantially reduced when a level ofprecision is reduced. For instance, at a precision level of 0.0050, thenumber of points may be reduced from 103,780 points to 7,868 points,which is 7.6% of the simulated data structure 1910. In this manner, aninformed design decision may be made for the interpolation source code1930 and/or the computational representation 130 regarding tradeoffsbetween a level of precision and data storage size, as desired for agiven implementation. Embodiments are not limited in this context.

In some cases, it may take significant time and computational resourcesto simulate all points with an original set of statistics (e.g., 1million statistics for the maxF test). To reduce time and conservecomputational resources, a reduced number of statistics (e.g., 20,000statistics for the maxF test) could be used for a single point, and thenthe PAGE algorithm may be used on the simulated points to find finalgrid points. The original set of statistics (e.g., 1,000,000) may thenbe simulated for only the final grid points. This could be accomplishedusing a defined set of criteria.

For the maxF test, for example, 20,000 statistics on each of 103,780points may be simulated, and 10,001 quantiles on each of 103,780 pointsmay be generated. Assume CDFs are fitted with a precision of 0.0020. Theaverage number of curve parameters for different precisions are shown inTable 3, as follows:

TABLE 3 Precision 0.0050 0.0025 0.0020 0.0010 Avg. # of curve 7.26112.081 18.877 109.592 Parameters

Code and a DLL may be generated, and the PAGE algorithm may be appliedto the DLL to generate Table 4, as follows:

TABLE 4 Precision 0.0050 0.0045 0.0040 0.0035 0.0030 0.0025 0.0020Percentage of Points 10.6% 12.2% 14.9% 19.2% 27.3% 46.1% 87.7%

Using the results shown in Table 4, assume the points corresponding toprecision of 0.0030 are selected. The original set of statistics (e.g.,1,000,000 statistics) may be simulated on each of the selected points.The defined number of quantiles (e.g., 10,001 quantiles) on each ofselected points may be generated. The CDFs may be fitted with aprecision of 0.0005. Finally code and DLL may be generated for theselected points.

Since all points with 1,000,000 statistics are available, the PAGEalgorithm can do another evaluation, the results of which are shown inTable 5 as follows:

TABLE 5 Quantile Estimates 100% Max 0.002834907  99% 0.000847933  95%0.000661086  90% 0.000603617  75% Q3 0.000530835  50% Median 0.000479984 25% Q1 0.000442317  10% 0.000411247  5% 0.000394015  1% 0.000361853  0%Min 0.000265525

Various aspects of the evaluation component 122-4 in general, and thedata reduction generator 2220 and PAGE algorithm in particular, may bedescribed with reference to FIGS. 25-27, infra.

FIG. 25 illustrates one example of a logic flow 2500. The logic flow2500 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the data reductiongenerator 2220 of the evaluation component 122-4 of the simulationsubsystem 120 of the automated statistical test system 100.

The logic flow 2500 illustrates evaluation operations performed inaccordance with an exemplary PAGE algorithm. In general, the PAGEalgorithm determines, given some target level of precision, whether anoriginal interpolation grid G₂ could be reduced into a smallerinterpolation grid G₁, without deleting any points from an interpolationgrid G₀. In this example, the PAGE algorithm is implemented by thedistributed computing system 610 utilizing a general/captainarchitecture.

As shown in FIG. 25, the logic flow 2500 may initialize an output tableon a captain node at block 2502. The output table may store a candidatereduction data set. The logic flow 2500 may perform a jackknifeoperation on interpolation grid G₂ with N points to find the P pointsnot meeting the control parameters at 2504.

The logic flow 2500 may call a subroutine MPI_Allgathery for executionby a general node and the captain node at block 2506. The logic flow2500 may form an interpolation grid G₁ and update flags at 2508. Theinterpolation grid G₁ may include the interpolation grid G₀ plus Ppoints.

The logic flow 2500 may interpolate all quantiles through theinterpolation grid G₁ against a set of evaluation criterion until theprecision parameter is satisfied. For instance, the logic flow 2500 mayevaluate N points on the interpolation grid G₁ at 2510. The logic flow2500 may call subroutines MPI_Reduce and MPI_Bcast on the general nodeand/or the captain node to broadcast a maximum criterion and the pointsV to achieve a maximum criterion at 2512. The logic flow 2500 may testwhether the maximum criterion is less than or equal to a definedprecision level at 2514. If the maximum criterion is less than or equalto the defined precision level, then the general node may call thesubroutine MPI_Bcast to indicate a parameter qDONE is set to a value of1 at 2516. The PAGE algorithm then terminates.

If the maximum criterion is greater than the defined precision level,then the general node and/or the captain node may call the subroutineMPI_Bcast to indicate a parameter qDONE is set to a value of 0 and thepoint V at 2518. The captain node may update the interpolation grid G₁to include the interpolation grid G₁ plus the points V and update theflag at 2520. Operations at 2510, 2512, 2514, 2518 and 2520 may berepeated until the maximum criterion is less than or equal to a definedprecision level at 2514. The PAGE algorithm then terminates.

FIG. 26 illustrates one example of a logic flow 2600. The logic flow2600 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the simulationsubsystem 120 of the automated statistical test system 100. Moreparticularly, the logic flow 2600 illustrates procedure for thesimulation subsystem 120 to generate a computational representation 130.

As shown in FIG. 26, the logic flow 2600 may simulate statistics byrepeating, for p equals 1 to P, simulating S statistics on point p,where S is set to 20,000 and P equals a number of all potential points(or parameter vectors), at block 2602. Block 2602 may output S by Pstatistics at 2614.

The logic flow 2600 may generate quantiles by repeating, for p equals 1to P, generating Q quantiles on point p, where Q is set to 10,001, atblock 2604. Block 2604 may output Q by P quantiles at 2626.

The logic flow 2600 may fit CDFs by repeating, for p equals 1 to P,fitting a curve to Q quantiles on point p with at most F curveparameters, where F is set to 128, at block 2606. Block 2606 may outputF by P curve parameters at 2618.

The logic flow 2600 may generate C code using all P points for grid G₂and selected points for grid G₀ at block 2608. Block 2608 may output twoC files, four H files and two build scripts, at 2620.

The logic flow 2600 may build a TK-Extension using a SDSGUI to build twoDLLs at block 2610. Block 2610 may output a tkGrid2.dll and atkGrid0.dll at 2622.

The logic flow 2600 may run PAGE algorithm for different levels ofprecisions. Block 2612 outputs a table of number of points versus agiven level of precision at 2624. Control is then passed to controllocation G.

FIG. 27 illustrates one example of a logic flow 2700. The logic flow2700 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the simulationsubsystem 120 of the automated statistical test system 100. Moreparticularly, the logic flow 2700 illustrates procedure for thesimulation subsystem 120 to reduce a data storage size for acomputational representation 130.

As shown in FIG. 27, the logic flow 2700 may receive control fromcontrol location G, and select a proper number of points for thecomputational representation 130 at 2702. The proper number of pointsmay be selected by data reduction generator 2220, and it may be anexample of a reduced simulated data structure 2210.

The logic flow 2700 may simulate statistics by repeating, for p equals 1to B, simulating S statistics on point p, where S is set to 1,000,000and B equals the number of selected points (or parameter vectors), atblock 2704. Block 2704 may output S by B statistics at 2714.

The logic flow 2700 may generate quantiles by repeating, for p equals 1to B, generating Q quantiles on point p, where Q is set to 10,001, atblock 2706. Block 2706 may output Q by B quantiles at 2716.

The logic flow 2700 may fit CDFs by repeating, for p equals 1 to B,fitting a curve to Q quantiles on point p with at most F curveparameters, where F is set to 128, at block 2708. Block 2708 may outputF by P curve parameters at 2718.

The logic flow 2700 may generate C code using all B points for grid G₁at block 2710. Block 2710 may output one C file, two H files and onebuild script, at 2720.

The logic flow 2700 may build a TK-Extension using a SDSGUI to build oneDLL at block 2712. Block 2712 may output a tkGrid1.dll at 2722. ThetkGrid1.dll may be an example of an interpolation executable code 1940.

FIG. 28A illustrates a block diagram for a statistical test subsystem140. The statistical test subsystem 140 is part of the automatedstatistical test system 100. The statistical test subsystem 140 may, forexample, generate statistical significance values for results of astatistical test using an approximate probability distribution.

As shown in FIG. 28A, the statistical test subsystem 140 may include astatistical test application 2820 having various components 2822-s. Thestatistical test application 2820 may include a data handler component2822-1, a statistical test component 2822-2, and a significancegenerator component 2822-3. The statistical test application 2820 mayinclude more or less components 2822-s for other implementations.

The data handler component 2822-1 may be generally arranged to handledata sets for use in a statistical test 114. For instance, the datahandler component 2822-1 may receive a real data set 2810 from a clientdevice 602. The real data set 2810 may represent actual data foranalysis by the statistical test 114, such as sets of collected businessor enterprise data, as opposed to simulated data 330 used to generateapproximate probability distributions 132 for the statistical test 114.In one embodiment, for example, the real data set 2810 may comprise datarepresenting one or more physical phenomena, such as occurrences ofheads or tails in a coin flip, sales of a number of shoes in Asia, or apercentage increase or decrease in a financial portfolio. In oneembodiment, for example, the real data set 2810 may comprise datarepresenting one or more measurable phenomena, which may include bothphysical and non-physical phenomena. An example of non-measurablephenomena may include without limitation digital data from an electronicdevice, such as a sensor, computer, or characters on a display.Embodiments are not limited in this context.

The statistical test component 2822-2 may be generally arranged toperform the statistical test using the real data set 2810. Thestatistical test component 2822-2 may receive a computationrepresentation 130 from, for example, the simulation subsystem 120. Thestatistical test component 2822-2 may also receive the statistical testfunction 112 for the statistical test 114. As previously described, thecomputational representation 130 may be arranged to generate anapproximate probability distribution 132 for each point in a grid ofpoints from simulated statistics 430 for the statistical test 114,statistics of the statistical test 114 to follow a probabilitydistribution of a known or unknown form. The approximate probabilitydistribution function 132 may comprise an empirical CDF, the empiricalCDF to have a first level of precision relative to the probabilitydistribution of the known or unknown form based on a sample size of thesimulated statistics.

The statistical test component 2822-2 may generate a set of statistics2824 for the statistical test 114 using the real data set 2810 and thestatistical test function 112.

The significance generator component 2822-3 may be generally arranged togenerate a set of statistical significance values 2830 for thestatistics 2824 generated by the statistical test component 2822-2 usingthe approximate probability distribution 132 of the computationalrepresentation 130. The set of statistical significance values may be inthe form of one or more p-values.

A p-value may generally represent a probability of obtaining a giventest statistic from observed or measurable data, such as a teststatistic obtained or evaluated from the real data set 2810. Moreparticularly, a p-value may represent a probability of obtaining a teststatistic evaluated from the real data set 2810 that is at least as“extreme” as one that was actually observed, assuming the nullhypothesis is true. For instance, assume a statistical test 114 involvesrolling a pair of dice once and further assumes a null hypothesis thatthe dice are fair. An exemplary test statistic may comprise “the sum ofthe rolled numbers” and is one-tailed. When the dice are rolled, assumea result where each rolled dice finally lands and presents a side with anumber 6. In this case, the test statistic is the sum of the rollednumbers from both dice, which would be 12 (6+6=12). A p-value for thisparticular result or outcome is a probability of 1/36, or approximately0.028. The p-value of 0.028 represents the highest test statistic out of6×6=36 possible outcomes. If a significance level of 0.05 is assumed,then this result would be deemed significant since 0.028 is lower (ormore extreme) value than 0.05. As such, the observed result of 12 fromthe rolled dice would amount to evidence that could be used to rejectthe null hypothesis that the dice are fair.

Once p-values are generated, the significance generator component 2822-3may use the p-values in a number of different ways. For instance, thesignificance generator component 2822-3 may present the p-values in auser interface view on an electronic display, an example of which isdescribed with reference to FIG. 28B, infra. A user may then determinewhether a null hypothesis for the statistical test 114 is rejected basedon the p-values.

Additionally or alternatively, this determination may be automaticallymade by the statistical application 2820. For instance, the significancegenerator component 2822-3 may compare a p-value to a defined thresholdvalue. The significance generator component 2822-3 may then determinewhether a null hypothesis for the statistical test 114 is rejected basedon a comparison of a p-value to a defined threshold value. Thesignificance generator component 2822-3 may then display a conclusionfrom the results on the electronic display.

FIG. 28B illustrates a user interface view 2850. The user interface view2850 illustrates an exemplary user interface presenting output of astatistical test 114 in the form of a Bai and Perron's multiplestructural change test as executed by the statistical test application2820.

This example illustrates how to use Bai and Perron's multiple structuralchange tests and the p-values generated from a HPSIMULATE procedure. Ituses the following notations:

t: a time index

y: a dependent variable

x: an independent variable

ε: an innovation

i.i.d.: independent and identically distributed

N(0,1): a standard normal distribution with mean 0 and variance 1

H₀: a null hypothesis

H₁: an alternative hypothesis

m: a number of break points in the data

supF_(l+1|l): a sequential test for multiple structural change proposedby Bai and Perron, where l is the number of break points in the nullhypothesis and l+1 in the alternative hypothesis

As shown in a DATA operation 2852, labeled in the user interface view as“data one,” the data generating process (DGP) has two break points attime indices 60 and 140. Precisely, the structural change model is asfollows:

$y_{t} = \left\{ \begin{matrix}{{2 + x_{t} + ɛ_{t}},{t \leq 59}} \\{{3 + {2\; x_{t}} + ɛ_{t}},{60 \leq t \leq 139},{ɛ_{t} \sim {i.i.d.{N\left( {0,1} \right)}}}} \\{{3 + {2.9\; x_{t}} + ɛ_{t}},{t \geq 140}}\end{matrix} \right.$

In a PROC operation 2854, labeled in the user interface view 2850 as“proc autoreg,” a BP=(M=3) option is set in the AUTOREG procedure toapply Bai and Perron's multiple structural change tests on the data. Theuser interface view 2850 shows the result of supF_(l+1|l) tests in atable 2856 annotated as “Bai and Perron's Multiple Structural ChangeTests, supF(l+1|l) Tests,” which sequentially checks the null hypothesisH₀: m=l versus the alternative null hypothesis H₁: m=l+1 for l=0, 1, 2,3, where m is the number of break points in the data. A statistic foreach test is shown in a column 2858 and a corresponding p-value,interpolated from the DLL generated by the HPSIMULATE procedure, isshown in a column 2860. If 15% is selected as a defined threshold value(e.g., a significance threshold), by comparing p-values to 15%, the nullhypothesis H₀: m=0 and H₀: m=1; are rejected. However, the nullhypothesis of H₀: m=2 cannot be rejected. According to oneinterpretation of these tests, there exists at least 2 break points inthe data.

For the supFl+1|l test, in literature, critical values for only foursignificance levels, namely 1%, 2.5%, 5%, and 10%, are available on someparameter vectors. Hence, a user can only make decision at those foursignificance levels on the finite parameter vectors by comparing thetest statistics, based on the real data set, with the critical valuesavailable in literature. However, with the support of HPSIMULATE systemand the DLL generated from it, the user can make decision at anysignificance level of interest (e.g., 15% here) on any parameter vector.

FIG. 29 illustrates one example of a logic flow 2900. The logic flow2900 may be representative of some or all of the operations executed byone or more embodiments described herein, such as the statistical testsubsystem 140 of the automated statistical test system 100.

As shown in FIG. 29, the logic flow 2900 may receive a computationalrepresentation arranged to generate an approximate probabilitydistribution for statistics of a statistical test based on a parametervector, statistics of the statistical test to follow a probabilitydistribution at block 2902. The probability distribution, for example,may comprise a probability distribution of a known or an unknown form.The logic flow 2900 may receive a real data set from a client device,the real data set to comprise data representing at least one measurablephenomenon or physical phenomenon at block 2904. The logic flow 2900 maygenerate statistics for the statistical test using the real data set onthe parameter vector at block 2906. The logic flow 2900 may generate theapproximate probability distribution of the computational representationon the parameter vector at block 2908. The logic flow 2900 may generatea set of statistical significance values for the statistics throughinterpolation at block 2910 by using the approximate probabilitydistribution of the computational representation, the set of statisticalsignificance values comprising one or more p-values, each p-value torepresent a probability of obtaining a given test statistic from thereal data set, at block 2906.

FIG. 30 illustrates a block diagram of a centralized system 3000. Thecentralized system 3000 may implement some or all of the structureand/or operations for the automated statistical test system 100 in asingle computing entity, such as entirely within a single device 3020.

The device 3020 may comprise any electronic device capable of receiving,processing, and sending information for the automated statistical testsystem 100. Examples of an electronic device may include withoutlimitation an ultra-mobile device, a mobile device, a personal digitalassistant (PDA), a mobile computing device, a smart phone, a telephone,a digital telephone, a cellular telephone, eBook readers, a handset, aone-way pager, a two-way pager, a messaging device, a computer, apersonal computer (PC), a desktop computer, a laptop computer, anotebook computer, a netbook computer, a handheld computer, a tabletcomputer, a server, a server array or server farm, a web server, anetwork server, an Internet server, a work station, a mini-computer, amain frame computer, a supercomputer, a network appliance, a webappliance, a distributed computing system, multiprocessor systems,processor-based systems, consumer electronics, programmable consumerelectronics, game devices, television, digital television, set top box,wireless access point, base station, subscriber station, mobilesubscriber center, radio network controller, router, hub, gateway,bridge, switch, machine, or combination thereof. The embodiments are notlimited in this context.

The device 3020 may execute processing operations or logic for theautomated statistical test system 100 using a processing component 3030.The processing component 3030 may comprise various hardware elements,software elements, or a combination of both. Examples of hardwareelements may include devices, logic devices, components, processors,microprocessors, circuits, processor circuits, circuit elements (e.g.,transistors, resistors, capacitors, inductors, and so forth), integratedcircuits, application specific integrated circuits (ASIC), programmablelogic devices (PLD), digital signal processors (DSP), field programmablegate array (FPGA), memory units, logic gates, registers, semiconductordevice, chips, microchips, chip sets, and so forth. Examples of softwareelements may include software components, programs, applications,computer programs, application programs, system programs, softwaredevelopment programs, machine programs, operating system software,middleware, firmware, software modules, routines, subroutines,functions, methods, procedures, software interfaces, application programinterfaces (API), instruction sets, computing code, computer code, codesegments, computer code segments, words, values, symbols, or anycombination thereof. Determining whether an embodiment is implementedusing hardware elements and/or software elements may vary in accordancewith any number of factors, such as desired computational rate, powerlevels, heat tolerances, processing cycle budget, input data rates,output data rates, memory resources, data bus speeds and other design orperformance constraints, as desired for a given implementation.

The device 3020 may execute communications operations or logic for theautomated statistical test system 100 using communications component3040. The communications component 3040 may implement any well-knowncommunications techniques and protocols, such as techniques suitable foruse with packet-switched networks (e.g., public networks such as theInternet, private networks such as an enterprise intranet, and soforth), circuit-switched networks (e.g., the public switched telephonenetwork), or a combination of packet-switched networks andcircuit-switched networks (with suitable gateways and translators). Thecommunications component 3040 may include various types of standardcommunication elements, such as one or more communications interfaces,network interfaces, network interface cards (NIC), radios, wirelesstransmitters/receivers (transceivers), wired and/or wirelesscommunication media, physical connectors, and so forth. By way ofexample, and not limitation, communication media 3012, 3042 includewired communications media and wireless communications media. Examplesof wired communications media may include a wire, cable, metal leads,printed circuit boards (PCB), backplanes, switch fabrics, semiconductormaterial, twisted-pair wire, co-axial cable, fiber optics, a propagatedsignal, and so forth. Examples of wireless communications media mayinclude acoustic, radio-frequency (RF) spectrum, infrared and otherwireless media.

The device 3020 may communicate with other devices 3010, 3050 over acommunications media 3012, 3042, respectively, using communicationsinformation 3014, 3044, respectively, via the communications component3040. The devices 3010, 3050 may be internal or external to the device3020 as desired for a given implementation. An example for the devices3010 may be one or more client devices used to access results from theautomated statistical test system 100.

FIG. 31 illustrates a block diagram of a distributed system 3100. Thedistributed system 3100 may distribute portions of the structure and/oroperations for the automated statistical test system 100 across multiplecomputing entities. Examples of distributed system 3100 may includewithout limitation a client-server architecture, a S-tier architecture,an N-tier architecture, a tightly-coupled or clustered architecture, apeer-to-peer architecture, a master-slave architecture, a shareddatabase architecture, and other types of distributed systems. Theembodiments are not limited in this context.

The distributed system 3100 may comprise a client device 3110 and aserver device 3150. In general, the client device 3110 and the serverdevice 3150 may be the same or similar to the client device 3020 asdescribed with reference to FIG. 30. For instance, the client device3110 and the server device 3150 may each comprise a processing component3130 and a communications component 3140 which are the same or similarto the processing component 3030 and the communications component 3040,respectively, as described with reference to FIG. 30. In anotherexample, the devices 3110, 3150 may communicate over a communicationsmedia 3112 using communications information 3114 via the communicationscomponents 3140.

The client device 3110 may comprise or employ one or more clientprograms that operate to perform various methodologies in accordancewith the described embodiments. In one embodiment, for example, theclient device 3110 may implement a client application 3116 to configure,control or otherwise manage the automated statistical test system 100.The client application 3116 may also be used to view results from theautomated statistical test system 100, such as statistical significancevalues or null hypothesis results. The client application 3116 may beimplemented as a thin-client specifically designed to interoperate withthe automated statistical test system 100. Alternatively, the clientapplication 3116 may be a web browser to access the automatedstatistical test system 100 via one or more web technologies.Embodiments are not limited in this context.

The server device 3150 may comprise or employ one or more serverprograms that operate to perform various methodologies in accordancewith the described embodiments. In one embodiment, for example, theserver device 3150 may implement the automated statistical test system100, and any interfaces needed to permit access to the automatedstatistical test system 100, such as a web interface. The server device3150 may also control authentication and authorization operations toenable secure access to the automated statistical test system 100 viathe media 3112 and information 3114.

FIG. 32 illustrates an embodiment of an exemplary computing architecture3200 suitable for implementing various embodiments as previouslydescribed. In one embodiment, the computing architecture 3200 maycomprise or be implemented as part of an electronic device. Examples ofan electronic device may include those described with reference to FIG.31, among others. The embodiments are not limited in this context.

As used in this application, the terms “system” and “component” areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution, examples of which are provided by the exemplary computingarchitecture 3200. For example, a component can be, but is not limitedto being, a process running on a processor, a processor, a hard diskdrive, multiple storage drives (of optical and/or magnetic storagemedium), an object, an executable, a thread of execution, a program,and/or a computer. By way of illustration, both an application runningon a server and the server can be a component. One or more componentscan reside within a process and/or thread of execution, and a componentcan be localized on one computer and/or distributed between two or morecomputers. Further, components may be communicatively coupled to eachother by various types of communications media to coordinate operations.The coordination may involve the uni-directional or bi-directionalexchange of information. For instance, the components may communicateinformation in the form of information communicated over thecommunications media. The information can be implemented as informationallocated to various signal lines. In such allocations, each message isa signal. Further embodiments, however, may alternatively employ datamessages. Such data messages may be sent across various connections.Exemplary connections include parallel interfaces, serial interfaces,and bus interfaces.

The computing architecture 3200 includes various common computingelements, such as one or more processors, multi-core processors,co-processors, memory units, chipsets, controllers, peripherals,interfaces, oscillators, timing devices, video cards, audio cards,multimedia input/output (I/O) components, power supplies, and so forth.The embodiments, however, are not limited to implementation by thecomputing architecture 3200.

As shown in FIG. 32, the computing architecture 3200 comprises aprocessing unit 3204, a system memory 3206 and a system bus 3208. Theprocessing unit 3204 can be any of various commercially availableprocessors, including without limitation an AMD® Athlon®, Duron® andOpteron® processors; ARM® application, embedded and secure processors;IBM® and Motorola® DragonBall® and PowerPC® processors; IBM and Sony®Cell processors; Intel® Celeron®, Core (2) Duo®, Itanium®, Pentium®,Xeon®, and XScale® processors; and similar processors. Dualmicroprocessors, multi-core processors, and other multi-processorarchitectures may also be employed as the processing unit 3204.

The system bus 3208 provides an interface for system componentsincluding, but not limited to, the system memory 3206 to the processingunit 3204. The system bus 3208 can be any of several types of busstructure that may further interconnect to a memory bus (with or withouta memory controller), a peripheral bus, and a local bus using any of avariety of commercially available bus architectures. Interface adaptersmay connect to the system bus 3208 via a slot architecture. Example slotarchitectures may include without limitation Accelerated Graphics Port(AGP), Card Bus, (Extended) Industry Standard Architecture ((E)ISA),Micro Channel Architecture (MCA), NuBus, Peripheral ComponentInterconnect (Extended) (PCI(X)), PCI Express, Personal Computer MemoryCard International Association (PCMCIA), and the like.

The computing architecture 3200 may comprise or implement variousarticles of manufacture. An article of manufacture may comprise acomputer-readable storage medium to store logic. Examples of acomputer-readable storage medium may include any tangible media capableof storing electronic data, including volatile memory or non-volatilememory, removable or non-removable memory, erasable or non-erasablememory, writeable or re-writeable memory, and so forth. Examples oflogic may include executable computer program instructions implementedusing any suitable type of code, such as source code, compiled code,interpreted code, executable code, static code, dynamic code,object-oriented code, visual code, and the like. Embodiments may also beat least partly implemented as instructions contained in or on anon-transitory computer-readable medium, which may be read and executedby one or more processors to enable performance of the operationsdescribed herein.

The system memory 3206 may include various types of computer-readablestorage media in the form of one or more higher speed memory units, suchas read-only memory (ROM), random-access memory (RAM), dynamic RAM(DRAM), Double-Data-Rate DRAM (DDRAM), synchronous DRAM (SDRAM), staticRAM (SRAM), programmable ROM (PROM), erasable programmable ROM (EPROM),electrically erasable programmable ROM (EEPROM), flash memory, polymermemory such as ferroelectric polymer memory, ovonic memory, phase changeor ferroelectric memory, silicon-oxide-nitride-oxide-silicon (SONOS)memory, magnetic or optical cards, an array of devices such as RedundantArray of Independent Disks (RAID) drives, solid state memory devices(e.g., USB memory, solid state drives (SSD) and any other type ofstorage media suitable for storing information. In the illustratedembodiment shown in FIG. 32, the system memory 3206 can includenon-volatile memory 3210 and/or volatile memory 3212. A basicinput/output system (BIOS) can be stored in the non-volatile memory3210.

The computer 3202 may include various types of computer-readable storagemedia in the form of one or more lower speed memory units, including aninternal (or external) hard disk drive (HDD) 3214, a magnetic floppydisk drive (FDD) 3216 to read from or write to a removable magnetic disk3218, and an optical disk drive 3220 to read from or write to aremovable optical disk 3222 (e.g., a CD-ROM or DVD). The HDD 3214, FDD3216 and optical disk drive 3220 can be connected to the system bus 3208by a HDD interface 3224, an FDD interface 3226 and an optical driveinterface 3228, respectively. The HDD interface 3224 for external driveimplementations can include at least one or both of Universal Serial Bus(USB) and IEEE 1394 interface technologies.

The drives and associated computer-readable media provide volatileand/or nonvolatile storage of data, data structures, computer-executableinstructions, and so forth. For example, a number of program modules canbe stored in the drives and memory units 3210, 3212, including anoperating system 3230, one or more application programs 3232, otherprogram modules 3234, and program data 3236. In one embodiment, the oneor more application programs 3232, other program modules 3234, andprogram data 3236 can include, for example, the various applicationsand/or components of the automated statistical test system 100.

A user can enter commands and information into the computer 3202 throughone or more wire/wireless input devices, for example, a keyboard 3238and a pointing device, such as a mouse 3240. Other input devices mayinclude microphones, infra-red (IR) remote controls, radio-frequency(RF) remote controls, game pads, stylus pens, card readers, dongles,finger print readers, gloves, graphics tablets, joysticks, keyboards,retina readers, touch screens (e.g., capacitive, resistive, etc.),trackballs, trackpads, sensors, styluses, and the like. These and otherinput devices are often connected to the processing unit 3204 through aninput device interface 3242 that is coupled to the system bus 3208, butcan be connected by other interfaces such as a parallel port, IEEE 1394serial port, a game port, a USB port, an IR interface, and so forth.

A monitor 3244 or other type of display device is also connected to thesystem bus 3208 via an interface, such as a video adaptor 3246. Themonitor 3244 may be internal or external to the computer 3202. Inaddition to the monitor 3244, a computer typically includes otherperipheral output devices, such as speakers, printers, and so forth.

The computer 3202 may operate in a networked environment using logicalconnections via wire and/or wireless communications to one or moreremote computers, such as a remote computer 3248. The remote computer3248 can be a workstation, a server computer, a router, a personalcomputer, portable computer, microprocessor-based entertainmentappliance, a peer device or other common network node, and typicallyincludes many or all of the elements described relative to the computer3202, although, for purposes of brevity, only a memory/storage device3250 is illustrated. The logical connections depicted includewire/wireless connectivity to a local area network (LAN) 3252 and/orlarger networks, for example, a wide area network (WAN) 3254. Such LANand WAN networking environments are commonplace in offices andcompanies, and facilitate enterprise-wide computer networks, such asintranets, all of which may connect to a global communications network,for example, the Internet.

When used in a LAN networking environment, the computer 3202 isconnected to the LAN 3252 through a wire and/or wireless communicationnetwork interface or adaptor 3256. The adaptor 3256 can facilitate wireand/or wireless communications to the LAN 3252, which may also include awireless access point disposed thereon for communicating with thewireless functionality of the adaptor 3256.

When used in a WAN networking environment, the computer 3202 can includea modem 3258, or is connected to a communications server on the WAN3254, or has other means for establishing communications over the WAN3254, such as by way of the Internet. The modem 3258, which can beinternal or external and a wire and/or wireless device, connects to thesystem bus 3208 via the input device interface 3242. In a networkedenvironment, program modules depicted relative to the computer 3202, orportions thereof, can be stored in the remote memory/storage device3250. It will be appreciated that the network connections shown areexemplary and other means of establishing a communications link betweenthe computers can be used.

The computer 3202 is operable to communicate with wire and wirelessdevices or entities using the IEEE 802 family of standards, such aswireless devices operatively disposed in wireless communication (e.g.,IEEE 802.11 over-the-air modulation techniques). This includes at leastWi-Fi (or Wireless Fidelity), WiMax, and Bluetooth™ wirelesstechnologies, among others. Thus, the communication can be a predefinedstructure as with a conventional network or simply an ad hoccommunication between at least two devices. Wi-Fi networks use radiotechnologies called IEEE 802.11x (a, b, g, n, etc.) to provide secure,reliable, fast wireless connectivity. A Wi-Fi network can be used toconnect computers to each other, to the Internet, and to wire networks(which use IEEE 802.3-related media and functions).

FIG. 33 illustrates a block diagram of an exemplary communicationsarchitecture 3300 suitable for implementing various embodiments aspreviously described. The communications architecture 3300 includesvarious common communications elements, such as a transmitter, receiver,transceiver, radio, network interface, baseband processor, antenna,amplifiers, filters, power supplies, and so forth. The embodiments,however, are not limited to implementation by the communicationsarchitecture 3300.

As shown in FIG. 33, the communications architecture 3300 comprisesincludes one or more clients 3302 and servers 3304. The clients 3302 mayimplement the client device 3110. The servers 3304 may implement theserver device 950. The clients 3302 and the servers 3304 are operativelyconnected to one or more respective client data stores 3308 and serverdata stores 3310 that can be employed to store information local to therespective clients 3302 and servers 3304, such as cookies and/orassociated contextual information.

The clients 3302 and the servers 3304 may communicate informationbetween each other using a communication framework 3306. Thecommunications framework 3306 may implement any well-knowncommunications techniques and protocols. The communications framework3306 may be implemented as a packet-switched network (e.g., publicnetworks such as the Internet, private networks such as an enterpriseintranet, and so forth), a circuit-switched network (e.g., the publicswitched telephone network), or a combination of a packet-switchednetwork and a circuit-switched network (with suitable gateways andtranslators).

The communications framework 3306 may implement various networkinterfaces arranged to accept, communicate, and connect to acommunications network. A network interface may be regarded as aspecialized form of an input output interface. Network interfaces mayemploy connection protocols including without limitation direct connect,Ethernet (e.g., thick, thin, twisted pair 10/100/1000 Base T, and thelike), token ring, wireless network interfaces, cellular networkinterfaces, IEEE 802.11a-x network interfaces, IEEE 802.16 networkinterfaces, IEEE 802.20 network interfaces, and the like. Further,multiple network interfaces may be used to engage with variouscommunications network types. For example, multiple network interfacesmay be employed to allow for the communication over broadcast,multicast, and unicast networks. Should processing requirements dictatea greater amount speed and capacity, distributed network controllerarchitectures may similarly be employed to pool, load balance, andotherwise increase the communicative bandwidth required by clients 3302and the servers 3304. A communications network may be any one and thecombination of wired and/or wireless networks including withoutlimitation a direct interconnection, a secured custom connection, aprivate network (e.g., an enterprise intranet), a public network (e.g.,the Internet), a Personal Area Network (PAN), a Local Area Network(LAN), a Metropolitan Area Network (MAN), an Operating Missions as Nodeson the Internet (OMNI), a Wide Area Network (WAN), a wireless network, acellular network, and other communications networks.

FIG. 34 illustrates an embodiment of a storage medium 3400. The storagemedium 3400 may comprise an article of manufacture. In one embodiment,the storage medium 3400 may comprise any non-transitory, physical, orhardware computer readable medium or machine readable medium, such as anoptical, magnetic or semiconductor storage. The storage medium may storevarious types of computer executable instructions 3402, such asinstructions to implement one or more of the logic flows as describedherein. Examples of a computer readable or machine readable storagemedium may include any tangible media capable of storing electronicdata, including physical memory, hardware memory, volatile memory ornon-volatile memory, removable or non-removable memory, erasable ornon-erasable memory, writeable or re-writeable memory, and so forth.Examples of computer executable instructions may include any suitabletype of code, such as assembly code, source code, compiled code,interpreted code, executable code, static code, dynamic code,object-oriented code, visual code, compressed code, uncompressed code,and the like. The embodiments are not limited in this context.

The computer executable instructions 3402 may be implemented using oneor more different types of programming languages. A programming languageis an artificial language designed to communicate instructions to amachine, particularly a computer. Programming languages can be used tocreate programs that control the behavior of a machine and/or to expressalgorithms. Many programming languages have computation specified in animperative form (e.g., as a sequence of operations to perform), whileother languages utilize other forms of program specification such as thedeclarative form (e.g., the desired result is specified, not how toachieve it). The description of a programming language is usually splitinto the two components of syntax (form) and semantics (meaning). Somelanguages are defined by a specification document (e.g. the Cprogramming language is specified by an ISO Standard), while otherlanguages (e.g., Perl) have a dominant implementation that is treated asa reference.

In one embodiment, for example, the computer executable instructions3402 may be implemented in a specific programming language as developedby SAS Institute, Inc., Cary, N.C. For instance, the computer executableinstructions 3402 may be implemented in a procedure referred to asHPSIMULATE, which is a procedure suitable for execution within a SASprogramming language and computing environment. In such embodiments, thecomputer executable instructions 3402 may follow syntax and semanticsassociated with HPSIMULATE. However, embodiments are not limited toHPSIMULATE, and further, do not need to necessarily follow the syntaxand semantics associated with HPSIMULATE. Embodiments are not limited toa particular type of programming language.

The HPSIMULATE procedure dynamically loads a TK-extension to performstatistical simulation and other tasks, such as post-processing,optimization, and other tasks. In one embodiment, the HPSIMULATEprocedure may perform statistical simulation in distributed computingand multi-thread environment.

The HPSIMULATE may have a syntax as follows:

  PROC HPSIMULATE DATA = SAS-data-set DATADIST = ( COPYONGENERAL |COPYTONODES | ROUNDROBIN | DEFAULT | INSLICES | COLUMNWISE |COLUMNWISEBY )  NAMELEN <= number>  NOCLPRINT <= number>  DEBUG$ <=number>  NTRIES = number  NOPRINT;  MODULE   EXT = name   TASK = number  DEPENDENT | CONTROLPARALLEL   TASKPARMV | VARPARM | VAR = (variable-list )   TASKPARMN | NUMBERPARM | TASKPARM = ( number-list )  TASKPARMS | STRINGPARM = ( quoted-string-list )   NAME = name;  OUTPUT OUT | OUT1 = SAS-data-set  OUT2 = SAS-data-set  OUT3 = SAS-data-set OUT4 = SAS-data-set  OUT5 = SAS-data-set  OUT6 = SAS-data-set  OUT7 =SAS-data-set  OUT8 = SAS-data-set  OUT9 = SAS-data-set  REG | REGSTART =number;  PERFORMANCE  NODES = number  NTHREADS = number.The options in gray font are some unnecessary options to run theHPSIMULATE procedure, or reserved for future usage.

A set of statements and options used with the HPSIMULATE procedure aresummarized in the following Table 6:

TABLE 6 Description Statement Option Data Set Options Specify the inputdata set HPSIMULATE DATA= Specify how the data are HPSIMULATE DATADIST=distributed on grid Write results to an output data set OUTPUT OUT= GridControl Options Specify the number of captains PERFORMANCE NODES=Specify the number of threads PERFORMANCE NTHREADS= Task Control OptionsSpecify the TK-extension MODULE EXT= to execute the tasks Specify thetask ID to be executed MODULE TASK= Specify whether the task needs toMODULE DEPENDENT control communication between threads and between nodesSpecify the variable names MODULE TASKPARMV= in input data set Specifythe number parameters MODULE TASKPARMN= Specify the string parametersMODULE TASKPARMS= Specify the name of the module MODULE NAME=

The HPSIMULATE procedure may use the following statement:

-   -   PROC HPSIMULATE options.

The HPSIMULATE statement may use a first option, as follows:

-   -   DATA=SAS-data-set.        The DATA option specifies the input data set containing        parameters for simulation or data for other tasks. If the DATA        option is not specified, PROC HPSIMULATE uses the most recently        created SAS data set.

The HPSIMULATE statement may use a second option, as follows:

-   -   DATADIST=(options)        The second option specifies how data is distributed on a        distributed computing system. The second option may have a set        of options as shown in Table 7, as follows:

TABLE 7 Option Description COPYONGENERAL Make a copy on general.COPYTONODES Make a copy of data set to each captain so that each captainhas all data. This is the default option. ROUNDROBIN Distribute the datato captains row-wisely according to round-robin rule. DEFAULT Distributethe data to captains row-wisely according to first-come-first-serverule. INSLICES Distribute the data to captains in slices. COLUMNWISEDistribute the data to captains column-wisely and evenly. COLUMNWISEBYDistribute the data to captains column- wisely according to the groupsdefined in the first row of data: (1) the group ID must be integer; (2)negative ID indicating the corresponding columns need not bedistributed; and (3) zero ID indicating the columns must be distributedto all captains.

The HPSIMULATE procedure may have a module statement as follows:

-   -   MODULE options.        The MODULE statement specifies the TK-extension and parameters        for the task to be executed. The MODULE statement may use seven        options, as follows:    -   EXT=name    -   TASK=number    -   DEPENDEN|CONTROLPARALLEL    -   TASKPARMV|VARPARM|VAR=(variable-list)    -   TASKPARMN|NUMBERPARM|TASKPARM=(number-list)    -   TASKPARMS|STRINGPARM=(quoted-string-list)    -   NAME=name

The EXT option specifies the name of the TK-extension to execute thetask. The TK-extension can focus on the task-oriented calculation sincethe data I/O, communication between client and grid and on grid, andmulti-threading are all left to the HPSIMULATE procedure. TheTK-extension is dynamically loaded in the procedure. The EXT=option mustbe specified. The TK-extension must follow some protocol defined in avirtual TK-extension which includes the structures of instance andfactory of functions; in other words, any user specified TK-extension isthe “child” of that virtual TK-extension which is called TKVRT andintroduced later in the Details section.

The TASK option specifies the task ID to be executed. The TK-extensionunderstands the task ID and executes the right task. By default,TASK=option is set to zero.

The DEPENDENT|CONTROLPARALLEL option specifies whether the task needs tocontrol communication between threads and between nodes.

The TASKPARMV|VARPARM|VAR option specifies the variables in the inputdata set. For example, if the input data set contains parameters for thesimulation, the variables are the names of parameters; if the input dataset is for post-processing, the variables define the columns of data tobe dealt with. The TASKPARMV option should be specified. If an inputdata set is not needed, a dummy data set and a dummy variable name maybe specified.

The TASKPARMN|NUMBERPARM|TASKPARM option specifies the number parametersfor the task. For example, the number of simulations, the random seed tostart, the optimization grid.

The TASKPARMS|STRINGPARM option specifies the string parameters for thetask. For example, the output folder, the output file name or prefix andsuffix.

The NAME option specifies a name of the module.

The HPSIMULATE procedure may include an output statement, as follows:

-   -   OUTPUT OUT=SAS-data-set        The OUTPUT statement creates an output SAS data set as specifies        by the following OUT option:    -   OUT=SAS-data-set        The OUT option names the output SAS data set containing the        task-dependent results which might be simulated statistics or        the quantiles.

The HPSIMULATE procedure may include a performance statement, calledPERFORMANCE. The PERFORMANCE statement is a common statement supportedin a high performance architecture (HPA) bridge. Only some options usedin the HPSIMULATE procedure are listed as follows:

-   -   NODES=number        The NODES option specifies a number of captains. If NODES=0 is        specified, the procedure is executed on client side and no        distributed computing environment computers are involved.    -   NTHREADS=number        The NTHREADS option specifies the number of threads to be used        in each computer.

The HPSIMULATE procedure is based, in part, on the HPLOGISTICSprocedure. The framework of the HPLOGISTICS procedure may implement alldata input/output, communication between client computers 602 and thedistributed computing system 610, or general and captain nodes of thedistributed computing system 610, and multi-threading details. Aframework extended on the framework of the HPLOGISTICS procedure isshown in FIGS. 9-11. The framework is flexible to support any simple andcomplex algorithm. In this manner, a client application may plug-in itsown tasks, like simulation or estimation. A user's TK-extension shouldfollow some protocol defined in a virtual TK-extension which includesstructures of instance and factory of functions. In other words, anyuser specified TK-extension is a “child” of that virtual TK-extensionwhich is called TKVRT.

For the virtual TK-extension TKVRT, the user-specified TK-extensionshould be a “child” of the TKVRT TK-extension. The TKVRT defines thefollowing public structures related to input parameters and outputresult:

struct TKVRT_COLUMN  /* Column name element */ {  int type;  intnamelen;  char name[TKVRT_MAXNAME];  tkvrtColumnPtr next; }; structTKVRT_DATA  /* Matrix in memory or   utility file on disk   with columnnames*/ {  TKBoolean QinMemory;  int64_t nRow;  int64_t curRow;  int64_tnColumn;  tkvrtColumnPtr colHead;  tkvrtColumnPtr colTail;  double *mat; tkrecUtFilePtr fid;  TKPoolh Pool; }; struct TKVRT_PARMS  /* Parameters*/ {  long nCaptains;  /* is the number of   captains */  longcaptainID; /* is the current captain ID */  long nThreads; /* is thenumber of threads */  long threadID; /* is the current thread ID */ long task;  /* is the task id */  char taskFlag[5];  /* is the taskflag */  long nTaskParm;  /* is the number of input   numberparameters*/  double *taskParmList;  /* is the list of input   numberparameters */  long nTaskParmStr;  /* is the number of input   stringparameters */  char **taskParmStrList;  /* is the list of input   stringparameters */  long *taskParmStrLenList;  /* is the list of the length  of input string   parameters */  long nInputData;  /* is the number ofinput   data sets */  tkvrtDataPtr inputDataList;  /* is the list ofinput   data sets */  long nOutputParm;  /* is the number of output  number parameters */  int64_t sOutputParm; /* is the size of allocatedmemory for output number parameters */  double *outputParmList;  /* isthe list of output   number parameters */  long nOutputInt64Parm;  /* isthe number of output   integer parameters */  int64_t sOutputInt64Parm; /* is the size of allocated   memory for output   integer parameters */ int64_t *outputInt64ParmList;  /* is the list of output   integerparameters */  long nOutputParmStr;  /* is the number of output   stringparameters */  char **outputParmStrList;  /* is the list of output  string parameters */  long *outputParmStrLenList;  /* is the list ofthe length   of output string   parameters */  long nOutputData;  /* isthe number of output   data sets */  tkvrtDataPtr outputDataList;  /* isthe list of output   data sets */  TKPoolh taskPool;  /* is the memoryPool */  TKMemPtr userPtr;  /* is the pointer to   anything else */ TKMemPtr userPtr1;  /* is the pointer to   anything else */  TKMemPtruserPtr2;  /* is the pointer to   anything else */  TKMemPtr userPtr3; /* is the pointer to   anything else */  TKMemPtr userPtr4;  /* is thepointer to   anything else */The function Set up Thread Work(.) in tksimt.c may provide details onhow the parameter structures are initialized.

The TKVRT also declares following public functions:

  TKStatus (*ValueGet  ) (tkvrtInstPtr, int, TKMemPtr,   TKMemSize *);TKStatus (*ValueSet  ) (tkvrtInstPtr, int,   TKMemPtr); TKStatus(*DestroyInstance) (tkvrtInstPtr *); TKStatus (*ResetInstance  )(tkvrtInstPtr); TKStatus (*Initialize  ) (tkvrtInstPtr); TKStatus(*Analyze  ) (tkvrtInstPtr); TKStatus (*Summarize  ) (tkvrtInstPtr);TKStatus (*GridInitialize  ) (tkvrtInstPtr); TKStatus (*GridSummarize )(tkvrtInstPtr, TKMemPtr);

The TKVRT are implemented in tkvrt.h, tkvrtmem.h, tkvrtp.h, and tkvrt.c.An example of the child of TKVRT is TKSCBP, which are implemented intkscbp.h, tkscbpp.h, and tkscbp.c, and is used to simulate multiplestructural change tests' statistics and generate the quantiles forconstructing the empirical CDFs.

Some embodiments may be described using the expression “one embodiment”or “an embodiment” along with their derivatives. These terms mean that aparticular feature, structure, or characteristic described in connectionwith the embodiment is included in at least one embodiment. Theappearances of the phrase “in one embodiment” in various places in thespecification are not necessarily all referring to the same embodiment.Further, some embodiments may be described using the expression“coupled” and “connected” along with their derivatives. These terms arenot necessarily intended as synonyms for each other. For example, someembodiments may be described using the terms “connected” and/or“coupled” to indicate that two or more elements are in direct physicalor electrical contact with each other. The term “coupled,” however, mayalso mean that two or more elements are not in direct contact with eachother, but yet still co-operate or interact with each other.

It is emphasized that the Abstract of the Disclosure is provided toallow a reader to quickly ascertain the nature of the technicaldisclosure. It is submitted with the understanding that it will not beused to interpret or limit the scope or meaning of the claims. Inaddition, in the foregoing Detailed Description, it can be seen thatvarious features are grouped together in a single embodiment for thepurpose of streamlining the disclosure. This method of disclosure is notto be interpreted as reflecting an intention that the claimedembodiments require more features than are expressly recited in eachclaim. Rather, as the following claims reflect, inventive subject matterlies in less than all features of a single disclosed embodiment. Thusthe following claims are hereby incorporated into the DetailedDescription, with each claim standing on its own as a separateembodiment. In the appended claims, the terms “including” and “in which”are used as the plain-English equivalents of the respective terms“comprising” and “wherein,” respectively. Moreover, the terms “first,”“second,” “third,” and so forth, are used merely as labels, and are notintended to impose numerical requirements on their objects in allsituations.

Some systems may use an open-source framework for storing and analyzingbig data in a distributed computing environment. For example, somesystems may use Hadoop® for applications in which the simulatedfunctions depend on given fixed data that are supplied externally to thealgorithm, and that these data can be read from distributed filesystems, such as Hadoop®. This could apply, for example, if subsets ofthe data on different nodes correspond to different cases to besimulated. In that case, different nodes can do the simulations for thesubcases corresponding to the data that they read locally, without needto pass data across the network. To help make that process work, thesystem could adopted a map-reduce-like pattern for controlling whichnodes do which simulations.

Some systems may use cloud computing, which can enable ubiquitous,convenient, on-demand network access to a shared pool of configurablecomputing resources (e.g., networks, servers, storage, applications andservices) that can be rapidly provisioned and released with minimalmanagement effort or service provider interaction. Some grid systems maybe implemented as a multi-node cluster. Some systems may use a massivelyparallel processing (MPP) database architecture. Some systems may beused in conjunction with complex analytics (e.g., high-performanceanalytics, complex business analytics, and/or big data analytics) tosolve complex problems quickly.

What has been described above includes examples of the disclosedarchitecture. It is, of course, not possible to describe everyconceivable combination of components and/or methodologies, but one ofordinary skill in the art may recognize that many further combinationsand permutations are possible. Accordingly, the described architectureis intended to embrace all such alterations, modifications andvariations that fall within the spirit and scope of the appended claims.

1. An article of manufacture comprising a non-transitorycomputer-readable storage medium containing instructions that, whenexecuted, cause a system to: receive a computational representationarranged to generate an approximate probability distribution forstatistics of a statistical test based on a parameter vector, thestatistics of the statistical test to follow a probability distribution;receive a real data set from a client device, the real data set tocomprise data representing at least one measurable phenomenon; generatestatistics for the statistical test using the real data set on theparameter vector; generate the approximate probability distribution ofthe computational representation on the parameter vector; and generate aset of statistical significance values for the statistics throughinterpolation using the approximate probability distribution, the set ofstatistical significance values comprising one or more p-values, eachp-value to represent a probability of obtaining a given test statisticfrom the real data set.
 2. The article of claim 1, the approximateprobability distribution to comprise an empirical cumulativedistribution function (CDF), the empirical CDF to have a first level ofprecision relative to the probability distribution based on a samplesize of the statistics.
 3. The article of claim 1, further comprisinginstructions that, when executed, enable a system to present thep-values in a user interface view on an electronic display.
 4. Thearticle of claim 1, further comprising instructions that, when executed,enable a system to compare a p-value to a defined threshold value. 5.The article of claim 1, further comprising instructions that, whenexecuted, enable a system to determine whether a null hypothesis for thestatistical test is rejected based on a comparison of a p-value to adefined threshold value.
 6. The article of claim 5, further comprisinginstructions that, when executed, enable a system to determine whetherthere is a relationship between two measured phenomena when the nullhypothesis is rejected.
 7. The article of claim 5, further comprisinginstructions that, when executed, enable a system to determine whether acorrect hypothesis for the statistical test is based on a logicalcomplement of the null hypothesis when the null hypothesis is rejected.8. The article of claim 1, the computational representation to comprisea software component arranged for execution by processor circuitry togenerate the approximate probability distribution for the statisticaltest when testing a real data set.
 9. The article of claim 1, thecomputational representation to comprise source code or executable code.10. The article of claim 1, the computational representation to comprisea dynamic-link library (DLL).
 11. The article of claim 1, the parametervector to comprise a point in a grid of points used for interpolation.12. The article of claim 1, the probability distribution having a knownform.
 13. The article of claim 1, the probability distribution having anunknown form.
 14. An apparatus, comprising: processor circuitry; a datahandler component operative on the processor circuitry to receive a realdata set from a client device, the real data set to comprise datarepresenting at least one measurable phenomenon; a statistical testcomponent operative on the processor circuitry to receive acomputational representation arranged to generate an approximateprobability distribution for statistics of a statistical test based on aparameter vector, the statistics of the statistical test to follow aprobability distribution, generate statistics for the statistical testusing the real data set, generate the approximate probabilitydistribution of the computational representation; and a significancegenerator component operative on the processor circuitry to generate aset of statistical significance values for the statistics throughinterpolation using the approximate probability distribution, the set ofstatistical significance values comprising one or more p-values, eachp-value to represent a probability of obtaining a given test statisticfrom the real data set.
 15. The apparatus of claim 14, the approximateprobability distribution to comprise an empirical cumulativedistribution function (CDF), the empirical CDF to have a first level ofprecision relative to the probability distribution based on a samplesize of the statistics.
 16. The apparatus of claim 14, the significancegenerator component to present the p-values in a user interface view onan electronic display.
 17. The apparatus of claim 14, the significancegenerator component to compare a p-value to a defined threshold value.18. The apparatus of claim 14, the significance generator component todetermine whether a null hypothesis for the statistical test is rejectedbased on a comparison of a p-value to a defined threshold value.
 19. Theapparatus of claim 18, further comprising instructions that, whenexecuted, enable a system to determine there is a relationship betweentwo measured phenomena when the null hypothesis is rejected.
 20. Theapparatus of claim 18, further comprising instructions that, whenexecuted, enable a system to determine a correct hypothesis for thestatistical test is based on a logical complement of the null hypothesiswhen the null hypothesis is rejected.
 21. The apparatus of claim 14, thecomputational representation to comprise a software component arrangedfor execution by processor circuitry to generate the approximateprobability distribution for the statistical test when testing a realdata set.
 22. The apparatus of claim 14, the computationalrepresentation to comprise source code or executable code.
 23. Theapparatus of claim 14, the computational representation to comprise adynamic-link library (DLL).
 24. The apparatus of claim 14, the parametervector to comprise a point in a grid of points used for interpolation.25. The apparatus of claim 14, the probability distribution having aknown form.
 26. The apparatus of claim 14, the probability distributionhaving an unknown form.
 27. A computer-implement method, comprising:receiving, by circuitry, a computational representation arranged togenerate an approximate probability distribution for statistics of astatistical test based on a parameter vector, the statistics of thestatistical test to follow a probability distribution; receiving, bycircuitry, a real data set from a client device, the real data set tocomprise data representing at least one measurable phenomenon;generating, by circuitry, statistics for the statistical test using thereal data set; generating, by circuitry, the approximate probabilitydistribution of the computational representation; and generating, bycircuitry, a set of statistical significance values for the statisticsthrough interpolation using the approximate probability distribution,the set of statistical significance values comprising one or morep-values, each p-value to represent a probability of obtaining a giventest statistic from the real data set.
 28. The computer-implementedmethod of claim 27, the approximate probability distribution to comprisean empirical cumulative distribution function (CDF), the empirical CDFto have a first level of precision relative to the probabilitydistribution based on a sample size of the statistics.
 29. Thecomputer-implemented method of claim 27, comprising presenting thep-values in a user interface view on an electronic display.
 30. Thecomputer-implemented method of claim 27, comprising: comparing a p-valueto a defined threshold value; and determining whether a null hypothesisfor the statistical test is rejected based on results of the comparison.31. The computer-implemented method of claim 30, further comprisinginstructions that, when executed, enable a system to determine there isa relationship between two measured phenomena when the null hypothesisis rejected.
 32. The computer-implemented method of claim 30, furthercomprising instructions that, when executed, enable a system todetermine a correct hypothesis for the statistical test is based on alogical complement of the null hypothesis when the null hypothesis isrejected.
 33. The computer-implemented method of claim 27, thecomputational representation to comprise a software component arrangedfor execution by processor circuitry to generate the approximateprobability distribution for the statistical test when testing a realdata set.
 34. The computer-implemented method of claim 27, thecomputational representation to comprise source code or executable code.35. The computer-implemented method of claim 27, the computationalrepresentation to comprise a dynamic-link library (DLL).
 36. Thecomputer-implemented method of claim 27, the parameter vector tocomprise a point in a grid of points used for interpolation.
 37. Thecomputer-implemented method of claim 27, the probability distributionhaving a known form.
 38. The computer-implemented method of claim 27,the probability distribution having an unknown form.